Conversion types and approaches
Uptranslation is a hard nut. There is no generic solution. We cook whatever we have at home… meaning we have to utilize all the information which is available in our input documents:
- text patterns
- relevant external knowledge
Every uptranslation project is different. We need to know the input quite well, so start with a thorough analysis. Also if possible it’s worth to collect data about the editorial workflow of our input documents:
- How many authors worked on these? Usually fewer is better from our perspective, since then their style, skills does not vary a lot.
- Did they use common templates?
- Any standard or custom styles?
- Strict workflow? Quality assurance?
One time versus integrated conversion
One time uptranslation is usually made to “jump up” to a structured information platform and stay there by using editorial tools support this structure. This is a typical scenario when the editorial work was used to be done unstructured and the company decides to move to structured authoring.
In this case the uptranslation code will only be used once and likely run by the developer directly or with close supervision of the developer. He knows “all” the input and can react, adapt quickly for corner cases do not pass. Also at this point likely the structure is more ductile, can be adjusted to maybe up til now unknown use cases.
Integrated uptranslation has to be built much more robust, since
- It’ll be used in the long run. Maintenance must be easy.
- The input is more unpredictable, the solution has to be generic.
- The structure is more rigid at this point, therefore it must me quite mature.
- The developer does not supervise the process, so it must be robust and have good error handling, report management.
Templates, styles, user contribution
Typically the unstructured input is an office document. That might be based on a template which has standard headlines what we can match. Of course this template could have changed over the time, so be aware, you might need to use alternative match expressions.
If pattern matching cannot be relied on and the layout / formatting does not help neither… so basically the input does not have the necessary information about the structure in any form, then we should ask the user to add at least some markers. For office documents, we can define custom styles which can hold structural / semantic information what can be used during the conversion to build up the desired structure..