Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3A8LSUG892" target="_blank" >RIV/00216208:11320/25:8LSUG892 - isvavai.cz</a>
Result on the web
<a href="https://doi.org/10.48550/arXiv.2404.00739" target="_blank" >https://doi.org/10.48550/arXiv.2404.00739</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek
Original language description
In this article, the beta version 0.1.0 of Opera Graeca Adnotata (OGA), the largest open-access multilayer corpus for Ancient Greek (AG) is presented. OGA consists of 1,687 literary works and 34M+ tokens coming from the PerseusDL and OpenGreekAndLatin GitHub repositories, which host AG texts ranging from about 800 BCE to about 250 CE. The texts have been enriched with seven annotation layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii) lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi) dependency function layer; (vii) Canonical Text Services (CTS) citation layer. The creation of each layer is described by highlighting the main technical and annotation-related issues encountered. Tokenization, sentence segmentation, and CTS citation are performed by rule-based algorithms, while morphosyntactic annotation is the output of the COMBO parser trained on the data of the Ancient Greek Dependency Treebank. For the sake of scalability and reusability, the corpus is released in the standoff formats PAULA XML and its offspring LAULA XML.
Czech name
—
Czech description
—
Classification
Type
J<sub>ost</sub> - Miscellaneous article in a specialist periodical
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
ArXiv
ISSN
2331-8422
e-ISSN
—
Volume of the periodical
2024
Issue of the periodical within the volume
2024
Country of publishing house
US - UNITED STATES
Number of pages
7
Pages from-to
1-7
UT code for WoS article
—
EID of the result in the Scopus database
—