We need to extract:
- data - numeric, text & binary
- metadata - list of tables, column types, indexes
Sometimes it’s best to filter the data and just extract what we need. If you leave some data behind, think twice, you might realise later on that you need that data after all. Beyond archiving the data, it should also be backed up "as is". In the worst case, the data can be re-archived based on the backup.
The data can be filtered by:
- tables (make an inclusion / exclusion list)
- arbitrary business logic (for example, older than five years)
The extracted data can be serialized into different formats, for example XML, JSON, CSV… etc. We use XML as basis here.
Pay attention to:
- Character encoding - The legacy source data could use any old character encoding, which might raise problems. Hopefully the extract tool deals with this properly and could convert the data into a modern encoding, like UTF8.
- Escape XML syntax characters - XML syntax chars, like < could occur in the data. These must be escaped with named character entities, like <.
- Invalid XML characters - The XML standard does exclude some character, for ex. most of the characters below the space char (decimal code 32) were used as control chars by legacy applications. Here you find the specification. Of course, there is a possible workaround for this, like using base64 encoding, but it's rarely worth it. It's usually garbage and does not hold any value. Make sure it's worthless before you delete it.
Please note that XML version 1.1 extends the supported charset compared to version 1.0, still it’s not recommended, since it’s not widely implemented by XML vendors.
- Data formats - Numeric values, date formats are specific to local.
- Binary data - XML can embed binary data as base64 encoded text. The extract tool might use this approach. Then this binary string must be extracted from the XML, decoded and saved as a binary file. They are usually not stored in an XML database.
5.1.3 Extraction tools
There are many ETL tools to choose from. It scales from heavy, expensive enterprise software til free, open source. Which one is best depends on the project and the skills of the project members.
If you have to archive many databases, then automation is your best friend. Extracting many databases manually using a GUI driven tool can be inefficient and error-prone.
Some examples and simple comparison: