Info

Abstract

Wikipedia has become a synonym for encyclopedic knowledge to a broad public; Wikipedias are more and more being used for a wide range of applications, among others as source material for research projects. For many languages of the world, the respective Wikipedia is the only freely available digital language resource.

To author Wikipedias, a so-called lightweight markup language, Wiki markup, is used which has a simple syntax, which is supposed to ease editing of web-content to be directly translated into HTML. Unfortunately, Wiki markup has a serious drawback: The lack of consistency in its application, which is mainly due to the fact that it is applied manually without the help of programs checking the digital text’s structural integrity (wellformedness) and / or logical consistency (validity). While in the past processing of digital texts often proceeded from plain text, XML technologies have become quite pervasive in many applications. Well-formed XML can be processed in many ways and ensures a higher degree of interoperability.

We have seen many projects aiming to convert Wikipedias into other formats. The probably best-known one is DBpedia, the machine-readable extract of Wikipedias’ structured portions, which is a cornerstone of knowledge modeling in the Semantic Web. The goal of our project was to put together a workflow that would allow us to transform the texts contained in Wikipedias into a format that would allow us to use our corpus creation and processing tools: Tokenizer, tagger, indexer, and digital reading environment. As all workflow steps in our environment have been geared towards TEI, we worked on routines to convert Wikipedias into this more expressive, more reusable format.

Our paper will discuss existing tools to perform this task, our own approach in converting Wiki markup into TEI, and it will give examples of how this data can be used for research.