Abstract

The paper will provide an overview of and an update on the ongoing proposal to create a standOff component within the TEI architecture. It will elicit the conceptual background of having stand-off annotations embedded within a TEI document and the consequences in terms of primary source preservation, multiple annotation views and possible exporting of annotation content into autonomous TEI documents. It will demonstrate the various types of possible use cases ranging from manual annotation to fully automatized information extraction processes and show the importance of implementing, right from the onset, the possibility to use any kind of internal or external vocabulary for representing annotation bodies (e.g. to deal with structural or conceptual annotations). An important prospect here is that the standOff construct could lead to a simplified development of TEI-aware online services such as Named Entity Recognizers.

We will relate to ongoing initiatives and show the necessity to align with the Web Annotation Data Model (W3C) as well as with the recent introduction of the annotationBlock element for speech transcription (as part of the work carried out in the ISO standard 24624) as an elementary annotation crystal in the sense of Romary and Wegstein1. In this context, we will tackle the issue of implicitness in the representation of annotations and open the debate related to the trade-off between having a terse vs. highly flexible model.

We will end up by illustrating the application that is already made of the current proposal in various projects related to data mining or scientific information, and in particular to the representation of annotated scholarly content.

Further material

  • Minutes of the January 2014 meeting in: polytechnic , 2014. http://download2.polytechnic.edu.na/pub7/sourceforge/l/li/lingsig/Documents/Standoff%20in%20Berlin,%2001.2014/standoff-minutesBerlin2014.pdf
  • The TEI GitHub ticket in: GitHub , 2012. https://github.com/TEIC/TEI/issues/374
  • The standOff proposal on GitHub in: GitHub , 2016. https://github.com/laurentromary/stdfSpec
  • References

  • Why TEI standoff annotation doesn’t quite work: and why you might want to use it nevertheless in: Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies , 2010, 5.
  • ISO/DIS 24624: Language resource management -- Transcription of spoken language in: ISO , 2016. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=37338
  • Pose, Javier/Lopez, Patrice/Romary, Laurent: A Generic Formalism for Encoding Stand-off annotations in TEI in: , 2014. https://hal.inria.fr/hal-01061548
  • Romary, Laurent: TEI challenges in an accelerating digital world in: DiXiT Convention week, September 2015. The Hague, Netherlands , September 2015. https://hal.inria.fr/hal-01254365
  • Romary, Laurent/Wegstein, Werner: Consistent Modeling of Heterogeneous Lexical Structures in: Journal of the Text Encoding Initiative , 3., 2012. http://jtei.revues.org/540
  • W3C Web Annotation Data Model https://www.w3.org/TR/annotation-model/