4.3 Accessing the data

Accessing the data usually happens via search or navigation. Search is the most common approach, but it's defined and works differently if the data is

structured
- aggregated
- not-aggregated
unstructured

4.3.1 Structured and aggregated

It is the easiest scenario. The user wants to select one or a few pre-aggregated records. These records are typically hierarchical XML (like example 3); we identify some of their elements as search fields, those will make the query form. In our example, the employee name and travel date sound like perfect search fields.

In the context of an XML database based archive system, the query can easily be defined on a GUI, the query itself (XQuery) can be auto-generated by the system.

4.3.2 Structured and not-aggregated

If the data is not aggregated (like example 4), then the search has to do two things:

filter the data
aggregate the data

Aggregation is the most challenging since it requires a good knowledge of the data. The query could be a complex task to write and uses more machine resources to execute too.

However, it could also mean higher flexibility if the data is not pre-aggregated. The search can aggregate it different ways. It's certainly an advantage if the user requirements for the archive are not crystal clear at the point of the archiving. Then the data can be archived un-aggregated (keeping the relational model), and searches can be added at any point in time later when the demand arises.

There is only one problem with this approach: in a few years time maybe nobody will have knowledge about the internal (table schema) representation of the data.

Rendering the structured XML data the search hit can happen different ways, for example:

using a schema-specific formatting stylesheet (XSL)
returning a result table XML, what the archive system GUI can render in a user configurable way

Some data items present in the archive, might not have to be displayed to the user, for example, internal IDs used by the relational model for binding tables (see employee ID or travel ID in example 1 & 4). This is just data noise generated by the relational model.

4.3.3 Unstructured

Whether archiving data in unstructured form is enough depends on the nature of reports and views of the data that we are interested in keeping. Searching through unstructured content (i.e. full-text search) is far less powerful than searching in structured data, so we have to decide whether full-text search is enough or whether we need to filter the archived data in more complex ways. It might be helpful to have a search tool that understands the semantics of natural language, finds relationships in the unstructured data and allows us to do smarter searches.

One can talk for example on how we can give different weights to headings or chapter titles, and other heuristics that make sure that, in the full-text search, there are words that have more weight than others. Also, words from first sentences tend to convey more the meaning of the document.

Still, this is a last-resort search that should always be assisted with metadata restrictions, as in “search for receipts about conferences that happen between 10-1-2017 and 12-1-2017, where the dates can be the creation date of the document.

We have to think of why do users do text search over unstructured data. In the web, users want to search for topics, i.e., documents that talk about a certain subject, therefore the search doesn’t have to be exact, it just has to bring up a certain number of relevant documents. In the archiving world, we are talking about; users want to search for a single document, they want a certain receipt, a manual, a proof. For this kind of precision search, unstructured data should always be assisted with metadata/facet search.

4.3 Accessing the data