2.2 Unstructured data
Unstructured data is fort example
- textual content (PDF, HTML, office documents… etc)
- media assets (pictures, videos)
It’s usually made for human consumption. It does not have explicit semantics, only implicit what humans can easily interpret, but machines still find it challenging. AI can turn this, but time will tell when.
Let's use the same example. Present the travel report as unstructured content:
"Ole Normann has travelled to Oslo for a customer meeting on 2018-01-11. He spent 100 NOK on the bus ticket and 1000 NOK on the accommodation."
For us human beings it's easier to read this text than the XML fraction above. Still, the XML representation has explicit semantics, and it's a lot easier for a computer to interpret and extract facts (like the price of the bus ticket), than from the unstructured text.
If it’s not presented as plain text, but as a PDF report with a table listing all transactions, the task to interpret this is likewise complex for a computer. A PDF file does not contain semantic markup, tagging each information shard, it only has layout, formatting, it’s optimized for human consumption.