Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F20%3A10426932" target="_blank" >RIV/00216208:11320/20:10426932 - isvavai.cz</a>
Result on the web
<a href="https://www.aclweb.org/anthology/2020.lrec-1.481" target="_blank" >https://www.aclweb.org/anthology/2020.lrec-1.481</a>
DOI - Digital Object Identifier
—

Result language
angličtina
Original language name
Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus
Original language description
Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries. We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary (Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al., 2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas and 5.4M word forms, in a total of 68 languages.
Czech name
—
Czech description
—

Type
O - Miscellaneous
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Publication year
2020
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Similar results(10)