Encoding a Dictionary of Russian Dialects in TEI and linking to LOD
Resources
by
Daniel Schopper
Kira Kovalenko
Thierry Declerck
Eveline Wandl-Vogt
Info
Title | Encoding a Dictionary of Russian Dialects in TEI and linking to LOD
Resources |
---|---|
responsible |
Encoded by Vanessa Hannesschläger Encoded by Daniel Schopper |
License |
The Creative Commons Attribution 4.0 International (CC BY 4.0) License applies to this text. |
Abstract
Within a cooperation project between the Russian and Austrian Academies of Sciences, we
are investigating the TEI encoding of the Dictionary of Russian dialects, which contains
more than 300,000 entries distributed over 48 volumes. The goal of the study is to
increase accessibility, interoperability and reusability of this rich source of
dialectal data. Our current proposal for a TEI representation consists in encoding the
official Russian word as a TEI entry
element and to use the cit
element for each occurrence of a dialect form (within the quote
element).
Within the cit
element, we then also include within the usg
element
the geo-location for indicating the region in which the dialect form is used. And
finally, we include in the cit
block available bibliographical information
(bibl
) – in most cases, from which source the dialect word has been
collected.
The meaning of the entry is given in the original dictionary in the form of free text.
We are currently working on offering more structure to this part of the original
entries, with relevant parts tagged as name
s or, if more flexibility is needed
(not only proper nouns),
"referring strings"
rs
. In the context of the name
element, we include then also
conceptual information, for example that an entry is the name of a family of plants (…
name type="botanicFamily"сложноцветных/name…), which we can
then link to the scientific name of this family: name type="plant"
subtype="scientific" key="taxonid:.." xml:lang="la"Taraxacum officinale
Wegg./name. This way, we can easily link the original entry in the
dialectal dictionary to taxonomic datasets that are available in the Linked Open Data
cloud, and to other language data included in the Linguistic Linked Open Data.