5.2.1 Data types
Unfortunately data types are defined redundantly, at multiple places in the archive workflow:
- source database - If it’s (typically) a relational database, then it uses standard SQL data types, but also vendor specific data types.
- archive system - The archive system itself defines data types on the high level, they can appear in the configuration, admin GUI defines search forms, where each search field has a data type.
- archive database - If it’s an XML database, then likely it uses data types from a schema language it supports. For example W3C Schema has a long list of primitive and derived data types. Being aware of the differences between the integer and the int types is far from trivial.
These three data type definitions must be harmonized. For a huge data set, there are plenty of details in this equation, dealing with them is error-prone. Please don't underestimate.
5.2.2 Syntax & granularity
Transformation must convert the extracted data to a syntax the archive system knows. If the data is extracted as XML and the archive also stores XML, still they likely use a different schema. Therefore the extracted XML must be converted.
If the conversion does not remodel the data (for example from relational to hierarchical), then it's a relatively simple task. The only challenge is the data size. The conversion must also work with huge input files, so it must use a relatively low memory footprint during the conversion.
The archive system might require different granularity what the extraction tool provides. For example, the extractor tool creates a single output file per database table, but the archive system has a file size limit it's able to load. Then the transformation also has to split the files.
5.2.3 Record aggregation
If we want to store hierarchical, aggregated records in the archive system, but we have relational data in the source databases, then we must remodel the data.
It's a complex task, requires good understanding of the data on the database schema level and its business logic.
There is probably one exception: if we use the legacy application itself to do the aggregation by triggering its export/report functionality. If we're lucky, then this will give us structured aggregated data (for example in hierarchical XML format). If we're a bit less lucky, then it'll only produce unstructured reports for human consumption, as a PDF invoice for example. It's presentable for the user, but it's much harder to search its content. We can archive this together with its metadata what is better searchable.
If we the legacy application does not have export/report function what we can use, then we have to implement the record aggregation on a lower level. It's tough to do this on the already extracted data, serialised as XML files because these can be large files and for collecting all the data for the aggregated records, we would need to seek those files back and forth, which can be slow. We need a database for fast lookup.
Therefore we can use the source relational database and SQL to query the source and construct the aggregated records.
We can also do this in XML if we load all exported data (serialised as XML) into an XML database. In this database, we can remodel the data using XQuery.
Remodelling a big dataset can take long machine time also. It could still be worth it, since then we don't have to do the same in real time when the end user runs a search on an archive which uses the relational model. So if we remodel in advance and store aggregated records in our archive, we could get far better performance on the end user searches.
Summary of all options for record aggregation:
- Application level - Use the legacy application’s export/report functionality, then you either get
- structured aggregated data (like XML)
- unstructured aggregated data (like PDF), then you should add structured metadata bound to this also
- Data level - two possible options:
- SQL - close to the source
- XQuery in an XML database, after data extraction
The data we’ve extracted and transformed needs to be packaged before loading it into the archive.
If you archive relational data directly, then you have "table files". Those might don't require special packaging, except you need to keep a naming convention to make them easier to process and add a manifest/configuration file to describe the data on the meta level.
If you use the OAIS standard and have to create SIP packages, then use a packaging tool or APU, for example: OpenText InfoArchive SIP SDK.