Subwords-Only Alternatives to fastText for Morphologically Rich Languages

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F21%3A10439963" target="_blank" >RIV/00216208:11320/21:10439963 - isvavai.cz</a>
Výsledek na webu
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=HUjN.9Fn4B" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=HUjN.9Fn4B</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1134/S0361768821010059" target="_blank" >10.1134/S0361768821010059</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Subwords-Only Alternatives to fastText for Morphologically Rich Languages
Popis výsledku v původním jazyce
Abstract: In this work, we present purely subword-based alternatives to fastText word embedding algorithm The alternatives are modifications of the original fastText model, but rely on subword information only, eliminating the reliance on word-level vectors and at the same time helping to dramatically reduce the size of embeddings. Proposed models differ in their subword information extraction method: character n-grams, suffixes, and the byte-pair encoding units. We test the models in the task of morphological analysis and lemmatization for 3 morphologically rich languages: Finnish, Russian, and German. The results are compared with other recent subword-based models, demonstrating consistently higher results.
Název v anglickém jazyce
Subwords-Only Alternatives to fastText for Morphologically Rich Languages
Popis výsledku anglicky
Abstract: In this work, we present purely subword-based alternatives to fastText word embedding algorithm The alternatives are modifications of the original fastText model, but rely on subword information only, eliminating the reliance on word-level vectors and at the same time helping to dramatically reduce the size of embeddings. Proposed models differ in their subword information extraction method: character n-grams, suffixes, and the byte-pair encoding units. We test the models in the task of morphological analysis and lemmatization for 3 morphologically rich languages: Finnish, Russian, and German. The results are compared with other recent subword-based models, demonstrating consistently higher results.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Programming and Computer Software
ISSN
0361-7688
e-ISSN
1608-3261
Svazek periodika
47
Číslo periodika v rámci svazku
1
Stát vydavatele periodika
RU - Ruská federace
Počet stran výsledku
11
Strana od-do
56-66
Kód UT WoS článku
000620610400009
EID výsledku v databázi Scopus
2-s2.0-85101570469

Podobné výsledky(10)

One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi and Nepali Lexically Grounded Subword Segmentation

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Subwords-Only Alternatives to fastText for Morphologically Rich Languages

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)