Annotation of Multi-Word Expressions in Czech Texts
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14210%2F15%3A00085165" target="_blank" >RIV/00216224:14210/15:00085165 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Annotation of Multi-Word Expressions in Czech Texts
Original language description
Multi-word expressions (MWEs) are difficult to define and also difficult to annotate. Some of them cause serious errors in the traditional annotation pipeline tokenization - morphological analysis - morphological disambiguation. Many cases of incorrect annotation in Czech corpora are known. To narrow the research topic, we focus only in fixed MWEs ? those with fixed word order and no ellidable components. In this paper, we propose a corpus-based method that reveals fixed MWE candidates. From the web-based corpus of Czech, we extracted 25,091 expressions, 2,140 of them were identified as MWEs, 332 as probable MWEs, and 174 of them can be either MWEs or one single word. Our method is based on corpus data observation that indicates that people are unsurewhen writing a MWE whether it is one word, a word with dashes, or several words. The result is a list of MWE candidates and also an application that classifies the input as MWE, probable MWE, or non-MWE.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
AI - Linguistics
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/7F14047" target="_blank" >7F14047: Harvesting big text data for under-resourced languages</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2015
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Ninth Workshop on Recent Advances in Slavonic Natural Language Processing
ISBN
9788026309741
ISSN
2336-4289
e-ISSN
—
Number of pages
10
Pages from-to
103-112
Publisher name
Tribun EU
Place of publication
Brno
Event location
Karlova Studánka
Event date
Jan 1, 2015
Type of event by nationality
EUR - Evropská akce
UT code for WoS article
—