Title Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D

Encoded by   Vanessa Hannesschläger

Encoded by   Daniel Schopper


The Creative Commons Attribution 4.0 International (CC BY 4.0) License applies to this text.


The paper presents results from a curation project within CLARIN-D, in which an existing 1MWord corpus of German chat communication1 has been integrated into the DEREKO2 and DWDS3 corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin).4 The focus is on the solutions developed for converting and representing the corpus in a TEI format.

The corpus, which has been collected and built in 2002-2008, has originally been annotated using a home-grown XML format that describes the main structural features of chat log files and user postings as well as selected linguistic phenomena of computer-mediated communication (CMC). In order to ensure the sustainability of the resource and its interoperability with the corpus collections already available in CLARIN-D, one important subtask of the project was to define a schema and workflow for remodeling the resource in TEI. Since TEI P5 in its current version doesn’t include any models for the representation of CMC and social media genres, the project adopted and extended the modeling suggestions which have been defined and discussed in previous work of the TEI-SIG "computer-mediated communication (CMC)" 5 and defined a workflow for the automatic, lossless conversion of the source into the target schema.

The target schema6 has been tested not only with data from the chat corpus, but also with data from a range of other types of CMC and social media genres (whatsapp interactions, wikipedia talk pages, tweets, usenet posts) in order to provide a useful solution for the encoding of other corpora of that type as well. The schema and conversion workflow will be used for the integration of more CMC and social media corpora into the CLARIN-D infrastructures in the near future.


  • Beißwenger, Michael/Ermakova, Maria/Geyken, Alexander/Lemnitzer, Lothar/Storrer, Angelika: A TEI Schema for the Representation of Computermediated Communication in: Journal of the Text Encoding Initiative , 3, 2012.
  • Chanier, Thierry/Poudat, Celine/Sagot, Benoit/Antoniadis, Georges/Wigham, Ciara/Hriba, Linda/Longhi, Julien/Seddah, Djamé: The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres in: Journal of Language Technology and Computational Linguistics (JLCL) , 29/2, 1-30, 2014.
  • Margaretha, Eliza/Lüngen, Harald: Building Linguistic Corpora from Wikipedia Articles and Discussions in: Journal of Language Technology and Computational Linguistics (JLCL) , 29/2, 59-82, 2014.