Discovering Continuous Multi-word Expressions in Czech

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F18%3A00109727" target="_blank" >RIV/00216224:14330/18:00109727 - isvavai.cz</a>
Výsledek na webu
<a href="http://doi.org/10.13053/CyS-22-3-3022" target="_blank" >http://doi.org/10.13053/CyS-22-3-3022</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.13053/CyS-22-3-3022" target="_blank" >10.13053/CyS-22-3-3022</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Discovering Continuous Multi-word Expressions in Czech
Popis výsledku v původním jazyce
Multi-word expressions frequently cause incorrect annotations in corpora, since they often contain foreign words or syntactic anomalies. In case of foreign material, the annotation quality depends on whether the correct language of the sequence is detected. In case of inter-lingual homographs, this problem becomes difficult. In the previous work, we created a dataset of Czech continuous multi-word expressions (MWEs). The candidates were discovered automatically from Czech web corpus considering their orthographic variability. The candidates were classified and annotated manually. Afterwards, the dataset was extended automatically by generating all word forms of those MWEs that were annotated as nouns. In this work, we used the dataset as positive examples, we filtered out negative examples from the MWE candidates. We trained a classifier with mean accuracy 92.7%. We have shown that the combined approach slightly outperforms approaches concerning only association measures mainly on MWEs containing inter-lingual homographs and out-of-vocabulary words. The discovery methods can be applied to other languages which encounter orthographic variability in web corpora.
Název v anglickém jazyce
Discovering Continuous Multi-word Expressions in Czech
Popis výsledku anglicky
Multi-word expressions frequently cause incorrect annotations in corpora, since they often contain foreign words or syntactic anomalies. In case of foreign material, the annotation quality depends on whether the correct language of the sequence is detected. In case of inter-lingual homographs, this problem becomes difficult. In the previous work, we created a dataset of Czech continuous multi-word expressions (MWEs). The candidates were discovered automatically from Czech web corpus considering their orthographic variability. The candidates were classified and annotated manually. Afterwards, the dataset was extended automatically by generating all word forms of those MWEs that were annotated as nouns. In this work, we used the dataset as positive examples, we filtered out negative examples from the MWE candidates. We trained a classifier with mean accuracy 92.7%. We have shown that the combined approach slightly outperforms approaches concerning only association measures mainly on MWEs containing inter-lingual homographs and out-of-vocabulary words. The discovery methods can be applied to other languages which encounter orthographic variability in web corpora.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
<a href="/cs/project/EF16_013%2F0001781" target="_blank" >EF16_013/0001781: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2018
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Computación y Sistemas
ISSN
1405-5546
e-ISSN
2007-9737
Svazek periodika
22
Číslo periodika v rámci svazku
3
Stát vydavatele periodika
MX - Spojené státy mexické
Počet stran výsledku
8
Strana od-do
845-852
Kód UT WoS článku
000471005100013
EID výsledku v databázi Scopus
2-s2.0-85055482087

Podobné výsledky(10)

Annotation of Multi-Word Expressions in Czech Texts Annotation of Czech Texts with Language Mixing Enhancing the PARSEME Turkish Corpus of Verbal Multiword Expressions

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Discovering Continuous Multi-word Expressions in Czech

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)