Content / data flow control
The flow architecture
Complex content and data flow requires sophisticated control also. Let’s take a somewhat general, but still a bit simplified scenario to prevent us using too abstract language. A content publisher company gets content / data from external sources, also authors content internally by its editors, melts all this together and publishes on the net, which is their “live data”.
So let’s sketch this scenario:
We split the “playground” into 3 zones:
- external: data we can acquire but cannot control, quality is questionable
- internal: our company domain, here lies our data manufacture, assembly line, checkpoints…
- live: our data exposed to the public - must be highest possible quality
It looks fine so far, but if we zoom in a bit, a gradually built up system can look like this:
If a system evolves gradually, then for each small extension we use the easiest, the cheapest way - the shortest, without thinking too much about how we control the whole process in the long run.
Here in this diagram the data comes in at one point - guarded by border control, also goes out at a single point, also “guarded”. The data “flows” in between these two points on a “stream”. If we have many system components and there is direct data flow between these using different data format, integration technology, testing and monitoring, then we can easily end up with a mess without control. If we have one stream and every system component can put on or pull data from it keeping certain rules, then we can keep our data/content flow under control.
Publishing content repository
So what’s this “stream” all system components put on and pull data from? It’s better to call this the publishing content repository. It can store different
- stages of the data flowing through the pipelines
- revisions of the data
If this repository plays a central role in the system or the system is rather built around the CMS, depends on the project. When you design the system and ask this question from yourself, be aware that the component plays the important role as central content store has the most dependencies, so it’ll be the hardest to replace later on. Still every other components are easier to replace, since they're not integrated with several other components, but with only one.
If the repository is internal only and stores pre-generated content (structured content and formatted documents), then even a file based version control system with proper API can serve us well, like [git](https://en.wikipedia.org/wiki/Git_(software)).
If this repository also serves directly as backend of live web applications (for humans) or web services (for machines), then we need quick access to our content, especially if web pages are generated from structured content on demand. Then a document store or XML database can be a good choice.
Content flow testing
Content flow of diverse content can be overly complex because of the numberless of small details in the business logic. Without proper, automated test environment it’s impossible to keep control.
Publishing pipe code testing
The pipe transformation code must be tested thoroughly, not only with the current production data, because at a certain point of time it might not contain all data corner cases, which can be forgotten as time goes.
The only proper way if we setup specific test data, which covers all important cases, all corner cases. We can store both the content input and correct output in a version control system (beside the code?) and if we have to modify the transformation code, we run our tests and check if any content output changed using the version control system. Some output changes are good, some are not, anyway our test system will notify us how our code change affected the content transformation. It’s regression testing. I suggest to build up a proper test set gradually - as we develop, for every new feature we add test cases.
Every programming language has also a unit test framework, XSLT is not an exception neither. The most widely used one is XSpec.
Content structure testing
Content structure testing is called validation, what we’ve already covered. The structure of the content can be validated against a DTD or schema. Beyond that Schematron can check complex structure / data logic via XPath expressions.
Content testing via the baseline approach
Beyond code and structure testing we also need to track how the content itself is evolving. It seems like a classic review process, but this also double-checks if the publishing pipes did not make any unexpected changes.
The idea is simple: we assume that our live content is perfect (likely not, but this is the best we got) and before we push out new changes to live, we always compare and check the differences. It’s a manual process, works only with pre-generated content.
Our publishing repository can do version control, so it checks comparing the revisions:
- which documents got changed
- what are the changes in each document
Document changes can be different types:
If there are markup changes only, that can indicate that the changes were made programmatically, the publishing pipes changed. It’s good practise to isolate human and programmatic changes into separate revisions. Let the editors to publish all human content changes first and deploy pipeline code change, run content regeneration separately. Pipeline code changes can spread markup changes all over in many documents what makes this change hard to test manually.
This test is greatly human driven, so it’s highly important to provide an intuitive and easy to use interface. Pushing an overly complex workflow on the editors can hit back and provides false safety, no real control.
The baseline testing has to be on the very end of the pipeline, as close to the end product as possible. If we have many publishing channels, then this can be quite challenging, since manually checking the changes for all channels can be laborious. If we move the control point from the publishing endpoint closer to the editorial process, before it gets split into multiple channels, then we spare work, still control the editorial work, but then the pipe automation must be trusted. Of course the pipe automation can be quite thoroughly tested with methods described earlier, with regression or unit tests.