The challenge

We call a document unstructured if its structure is not explicit. Human readers can still perceive the structure and semantics, since they usually have so much background information on the context. For example they know that Oslo is a city. For computers we need to bring the semantics into the document:

<city>Oslo</city>

… in order to make the document processing simple and independent from external sources. Also there are complex cases where the meaning of a word is ambiguous and only contextual and/or background knowledge can help in the interpretation. It’s worth to take a look of the [Paris disambiguation](https://en.wikipedia.org/wiki/Paris_(disambiguation)) page on Wikipedia. Paris is also a name of a mythology figure. So seemingly easy tasks like list me all cities mentioned in some documents are relatively hard to accomplish for computers (or for programmers writing the code) because it’s not enough to list the city names and match those with the documents, since for distinguishing a city name from a person name we do need to understand the context.

So then I can mark:

<city>Paris</city>

… but is it then explicit? Well, depends. Yes, actually it depends on what I want to achieve with the semantic markup. We learnt from the Wikipedia page, that there are several cities called Paris in the US and also one in Denmark.

We could endlessly add more and more semantics to the content, there is almost no limit, but it’s quite much work and we need to draw the line, we need to find balance how much it’s worth to invest. We always must keep in mind what we want to achieve:

  • more controlled editorial process
  • publishing automation

results matching ""

    No results matching ""