I wrote about the standard versus proprietary structure dilemma before. The information analysis will find out the vocabulary you want to use. You should always try to go with a standard structure first, but if your domain specific vocabulary is huge, then you could end up implementing an overly complex custom layer on top of DITA, which gives you more trouble than benefit. If this is the case, then it’s better to start from scratch and only reuse for example the CALS table structure. Your structure and publication code will be cleaner and simpler.
If you don’t need much semantics, just some structure, then should not reinvent the wheel and create yet another Docbook or HTML-like structure, but go for a standard vocabulary.
I’ve also seen a project which moved the editorial process from DocBook to DITA, since DocBook was hard to customize. Sure DITA is better suited for customization, but the project used very little part of the DITA standard (only the most generic level - topic) and implemented a DocBook-like structure on top of it. What’s the point? It was a huge effort and where are the benefits?
Always think twice about the overall goal you want to achieve with your project. Your information platform design is a fundamental, long term decision. It should have a (much) longer lifecycle than your CMS software, so choose wisely.
Let’s assume that you want several content types, but anyway you’ll have only one structure. Creating one structure for each object types is unusual, since you share at least the bottom level between these.
- Often only presents in the publishing layer.
- Connects editorial units.
- In DocBook it’s the chapter level, in a dictionary it’s list of word articles …
- The real semantics live here.
- Non-semantic elements, often common constructions like lists or basic formatting (bold, italic, subscript, superscript… etc). Keep formatting non-specific and bound to the minimum in your semantic XML. Use semantic elements as much as possible instead of formatting. Specific formatting, like fonts, font size, indentation… etc should be in the style sheet, not in the semantic XML markup.
- This is the level where you’ll likely use mixed content (element content model allows both text and sub-elements) . It’s possible to avoid mixed content, but the result won’t be human readable XML what users can edit directly.
Naming elements in the structure correctly is probably the most important task. Use time on it! Names must come from the daily terminology the editors use everyday, then they will be less confused when they start using it. Don’t translate it to an other language, use the local language. XML can deal with unicode characters, so don’t be afraid using special characters from your language. They make the XML more human readable.
Short element names are good, but think if your element is generic - used overall in the whole structure - or specific, only present in a substructure. Use short names for generic elements and use longer names for specific ones. Don’t overdo using abbreviations.
If your vocabulary is really huge (for ex. several hundred elements), then you can consider to introduce several namespaces to express semantics in a clean way, instead of using very long specific element names.
Keeping naming conventions is also important, improves readability. I’m not aware of any de facto naming convention in XML, some people use camel casing, others like to separate words with dash or underline… etc. Just choose one style and be consistent.
If your content is diverse, therefore the structure is large - several hundred elements - then you should split it into modules. First of all you can split by levels - as discussed before - but also you can modularize by content types.
Using multiple namespaces will also define your structure modules. Likely bottom level - non semantic - element - will be referenced by all higher level modules.
Your content will flow through several stages - via XML pipelines - until it gets published. For example:
- Editorial - clean, non-redundant structure
- Export - editorial is mixed with data/content from different 3rd party sources
- Publishing - redundant structure, prepared for publishing, generated indexes...etc
- Application specific format - prepared for a concrete publishing platform, for ex:
- Web (HTML)
- Paper publishing or ebook, for consumption of a page oriented rendering software, like XSL-FO, FrameMaker, InDesign… etc.
- Voice publishing
- …. etc
Mixing in 3rd party vocabulary
Often XML data/content from external 3rd party sources gets converted into a “local language” since the company wants to store it in its local silo and using the local XML dialect seems to simplify things, the XML is “more consistent” etc. Well, beyond its clear benefits, this can raise many issues also. Let’s assume the 3rd party XML what we want to integrate into our local silo has an element called “A”. Our local data already uses the name “A”, but this has a bit different content model or semantics. Then we end up renaming the 3rd party “A” to something else (call it “B”)... because it’s so easy to do. Yes, converting it is indeed easy, but the long term maintenance won’t be.
Developers and actually also end users work with this terminology and they’ll get confused. These “ad-hoc” conversions usually won’t get properly documented or people don’t use the time to read the documentation. At the end remapping between the different vocabularies will lead to confusion and inefficient work.
In my opinion keeping the original 3rd party terminology with its namespace is a better way to go, if it’s a proper structure and the source is trusted.