Annotation of Czech Texts with Language Mixing
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F16%3A00091344" target="_blank" >RIV/00216224:14330/16:00091344 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1007/978-3-319-45510-5_32" target="_blank" >http://dx.doi.org/10.1007/978-3-319-45510-5_32</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-45510-5_32" target="_blank" >10.1007/978-3-319-45510-5_32</a>
Alternative languages
Result language
angličtina
Original language name
Annotation of Czech Texts with Language Mixing
Original language description
Language mixing (using chunks of foreign language in a native language utterance) occurs frequently. Foreign language chunks have to be detected because their annotation is often incorrect. In the standard pipelines of Czech texts annotation, no such detection exists. Before morphological disambiguation, unrecognized words are processed by Czech guesser which is successful on Czech words (e.g. neologisms, typos) but its usage makes no sense on foreign words. We propose a new pipeline that adds foreign language chunk and multi-word expression (MWE) detection. We experimented with a small corpus where we compared the original (semi-automatic) annotation (including foreign words and MWEs) with the results of the new pipelines. As a result, we reduced the number of incorrect annotations of interlingual homographs and foreign language chunks in the new pipeline compared to the standard one. We also reduced the number of tokens that have to be processed by the guesser.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—
Result continuities
Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2016
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings
ISBN
9783319455099
ISSN
0302-9743
e-ISSN
—
Number of pages
8
Pages from-to
279-286
Publisher name
Springer International Publishing
Place of publication
Switzerland
Event location
Switzerland
Event date
Jan 1, 2016
Type of event by nationality
CST - Celostátní akce
UT code for WoS article
—