Title Capturing the crowd-sourcing process: storing different stages of crowd-sourced transcriptions in TEI

Encoded by   Vanessa Hannesschläger

Encoded by   Daniel Schopper


The Creative Commons Attribution 4.0 International (CC BY 4.0) License applies to this text.


The Letters of 1916 is a project to create a collection of correspondence from around the time of the Easter Rising, written in Ireland or with an Irish context. The project uses a crowd-sourcing approach to transcription, inviting members of the public to contribute by transcribing letters and correcting those that have already been transcribed. Transcribers use a transcription desk with features borrowed from the Transcribe Bentham project. The backend, based on MediaWiki, stores each saved revision separately, along with relevant metadata.

During our editing workflow, all data is extracted from MediaWiki’s database and injected into TEI documents for long-term storage and web presentation. The final crowd-sourced transcription is checked by a member of the editorial team prior to inclusion in our online archive. In addition to storing the final marked-up version of the text, each revision stage is injected and logged in the TEI file. This affords researchers an invaluable resource to study the progress of crowd-encoding, its efficacy, and accuracy over time.

The storage of the different versions of transcriptions in TEI documents is a challenge as, being crowd-sourced, they are seldom well-formed. As an intermediate measure, to enable storage and limited access to the crowd-sourced versions, the revisionDesc element is employed to record the ID of the transcriber / editor and the time of the revision. The revision itself is "dumped" into the revisionDesc element inside comment tags to sidestep issues of wellformedness.

This paper will explore more robust solutions for storing and marking-up these XML-like fragments within a TEI document; it will examine possibilities and issues for storing crowd-sourced transcription versions, and how they might be mined for insight into transcription habits.