6.1 Does the archive contain all data correctly?
This question asks if the ETL process, the data migration from the source database into the archive system work correctly:
- all the data got transferred
- the data did not get corrupted
Some archive systems have built-in tests (like the "chain of custody" test in InfoArchive), but this only verifies the load/ingest process.
To be able to answer the question raised in this section, we need to compare the source data with the data present in the archive. It's an end-to-end test.
If the data is not remodelled (for example the relational model is kept), then it's more straightforward, since it's the data structure. We still have the corresponding concept for a table, row, column in the archive too, and it does not matter much if it's represented as XML.
We can test by
- making statistics
- comparing values
The statistical approach is simpler, but it's not that comprehensive. We can write scripts for both ends (source and target repositories) to make a report how many tables, columns, rows we have and these reports can be compared. If the data source is relational, then SQL can be used to make the report, while on the archive end, for example, XQuery can be used.
If we want to compare values instead, then we generate checksums on the value, row or table level and include this in the report what we compare.
If we alter the data during the ETL process, for example, legacy control characters (invalid in XML) are filtered, then the comparison is even harder because then it requires preprocessing.
If the archive is SIP-based and the data got remodelled, aggregated during the ETL process, then verifying if all the data is archived correctly is even harder. Then we need an alternative aggregation process. If it got aggregated after the data export, then we do the test aggregation before the export, in SQL, directly on the relational database, or we use the legacy application's report functionality to do that for us.
Testing is a must, but still, it's always hard to justify investments into writing tests. If you archive many databases, then investing into automating the ETL process pays back quickly. That also helps to make the ETL process more reliable, although it does not make testing avoidable.