Choosing the right technology and design a generic framework which runs the business logic is essential. Typically we use a pipe approach, because here we deal with a great amount of details and complexity, so we want to split the complex task into small easy steps. Executing the separate conversion steps one by one in a specific order makes the pipe.
There are many advantages to go for a standard XML pipe framework, like XProc.
Also keeping simplicity, maintainability in mind it’s the best to use declarative programming. XSLT is a good tool for the job, very powerful and fast. Keep the complexity low, do a a relatively simple thing in a step and use its output as an input for the next step. Each step forms the input a bit and makes us a little bit closer to the desired outcome. For example one step can build lists based on the inputs layout and the next step can normalize them, merging two lists into one if there is no content in between.
The pipe engine can keep the parsed XML document model in memory and won’t serialize it to XML syntax at each step, except if you’re in debug mode and want to examine closely what each step did.
If the input is a huge document (cannot be kept in memory), then add a step which splits this up to smaller chunks… or if you must keep it as one document, then use streaming (XSLT 3), but then the power of the language will be constrained a lot.
Do not lose content!
The worse thing ever if the conversion silently loses content. It might be OK if it cannot process something and so notifies the user, but if (important) content is lost without any track...that’s a fatal error. It’s mission critical, because even if the user reviews the result of the conversion, human beings (most of the time) just cannot notice small differences.
So we need to build up the conversion in a way that nothing can slip out. That’s the typical streaming design: the content flows through the processing and even if some unexpected input is coming, we mark that up also, maybe with an “unknown” tag and we deal with that later, consult with the customer… etc. The main point is that we uncovered the unexpected.
Validation is the most trivial way of testing. The structure you convert to must have a schema. Always validate against this schema at the end. You can also turn this schema into rules what the conversion uses all the way through, makes decisions what might come and what must not. So the schema can also control the conversion, not only useful at the end when the moment of truth come: if the result validates.
We might have to test more than what a structure schema can do for us. Then we go for rule based testing, for example using Schematron. Wikipedia: “Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. ” It’s very powerful, based on XPath.
Beyond these we often need to have a close look and do some manual checking. Of course since uptranslation is mostly heuristical, we usually cannot build it as robust as we’d like to. Sure, but how often we can check a big data set manually? It’d be very time consuming and error prone to repeat this often. So do this the smart way, use baseline testing, which is a kind of regression test. Run a conversion on a document, test the output, correct, yes? Good, then add the input and also the output to your version control system. You can store these in the same project together with your uptranslation source code (in svn, git… etc). Then as you make progress with the development and test new features, keep adding these to your test baseline (that’s what we trust) and also test all test cases by running them as a bulk and compare (using your version control system) if your output is the same. If there is a change, evaluate it and update your code or baseline.
Start to collect test cases as early as possible, since you start coding. Each time you are done with a new feature or did a bug fix, run all tests and check if your code update led only to the desired content changes. It’s extremely useful when you have to do bug fixing on a complex integrated conversion.
Usually standard text level comparison is just fine if you keep using the same XML serializer. If you make some essential changes or replace the serializer, that can lead to too many differences what is difficult to evaluate manually. Then you need to normalize both sides what you compare (the old and the new baselines), for ex:
- convert to the canonical XML format
- … or just use an XML compare tools
- a simple independent can temporarily remove new added markup if that’s spread everywhere