Discovering Continuous Multi-word Expressions in Czech
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F18%3A00109727" target="_blank" >RIV/00216224:14330/18:00109727 - isvavai.cz</a>
Result on the web
<a href="http://doi.org/10.13053/CyS-22-3-3022" target="_blank" >http://doi.org/10.13053/CyS-22-3-3022</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.13053/CyS-22-3-3022" target="_blank" >10.13053/CyS-22-3-3022</a>
Alternative languages
Result language
angličtina
Original language name
Discovering Continuous Multi-word Expressions in Czech
Original language description
Multi-word expressions frequently cause incorrect annotations in corpora, since they often contain foreign words or syntactic anomalies. In case of foreign material, the annotation quality depends on whether the correct language of the sequence is detected. In case of inter-lingual homographs, this problem becomes difficult. In the previous work, we created a dataset of Czech continuous multi-word expressions (MWEs). The candidates were discovered automatically from Czech web corpus considering their orthographic variability. The candidates were classified and annotated manually. Afterwards, the dataset was extended automatically by generating all word forms of those MWEs that were annotated as nouns. In this work, we used the dataset as positive examples, we filtered out negative examples from the MWE candidates. We trained a classifier with mean accuracy 92.7%. We have shown that the combined approach slightly outperforms approaches concerning only association measures mainly on MWEs containing inter-lingual homographs and out-of-vocabulary words. The discovery methods can be applied to other languages which encounter orthographic variability in web corpora.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/EF16_013%2F0001781" target="_blank" >EF16_013/0001781: LINDAT/CLARIN - Research infrastructure for language technologies - extension of the repository and its computational power</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2018
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Computación y Sistemas
ISSN
1405-5546
e-ISSN
2007-9737
Volume of the periodical
22
Issue of the periodical within the volume
3
Country of publishing house
MX - MEXICO
Number of pages
8
Pages from-to
845-852
UT code for WoS article
000471005100013
EID of the result in the Scopus database
2-s2.0-85055482087