Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F20%3A10426932" target="_blank" >RIV/00216208:11320/20:10426932 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.aclweb.org/anthology/2020.lrec-1.481" target="_blank" >https://www.aclweb.org/anthology/2020.lrec-1.481</a>
DOI - Digital Object Identifier
—

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus
Popis výsledku v původním jazyce
Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries. We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary (Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al., 2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas and 5.4M word forms, in a total of 68 languages.
Název v anglickém jazyce
Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus
Popis výsledku anglicky
Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries. We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary (Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al., 2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas and 5.4M word forms, in a total of 68 languages.

Klasifikace

Druh
O - Ostatní výsledky
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2020
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Podobné výsledky(10)

Enhancing the PARSEME Turkish Corpus of Verbal Multiword Expressions Unsupervised Extraction of Morphological Categories for Morphemes Experiments with the recognition of highly inflected spoken language (czech) in the large vocabulary task

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Podobné výsledky(10)