All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Annotation of Multi-Word Expressions in Czech Texts

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14210%2F15%3A00085165" target="_blank" >RIV/00216224:14210/15:00085165 - isvavai.cz</a>

  • Result on the web

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Annotation of Multi-Word Expressions in Czech Texts

  • Original language description

    Multi-word expressions (MWEs) are difficult to define and also difficult to annotate. Some of them cause serious errors in the traditional annotation pipeline tokenization - morphological analysis - morphological disambiguation. Many cases of incorrect annotation in Czech corpora are known. To narrow the research topic, we focus only in fixed MWEs ? those with fixed word order and no ellidable components. In this paper, we propose a corpus-based method that reveals fixed MWE candidates. From the web-based corpus of Czech, we extracted 25,091 expressions, 2,140 of them were identified as MWEs, 332 as probable MWEs, and 174 of them can be either MWEs or one single word. Our method is based on corpus data observation that indicates that people are unsurewhen writing a MWE whether it is one word, a word with dashes, or several words. The result is a list of MWE candidates and also an application that classifies the input as MWE, probable MWE, or non-MWE.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

    AI - Linguistics

  • OECD FORD branch

Result continuities

  • Project

    <a href="/en/project/7F14047" target="_blank" >7F14047: Harvesting big text data for under-resourced languages</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach

Others

  • Publication year

    2015

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Ninth Workshop on Recent Advances in Slavonic Natural Language Processing

  • ISBN

    9788026309741

  • ISSN

    2336-4289

  • e-ISSN

  • Number of pages

    10

  • Pages from-to

    103-112

  • Publisher name

    Tribun EU

  • Place of publication

    Brno

  • Event location

    Karlova Studánka

  • Event date

    Jan 1, 2015

  • Type of event by nationality

    EUR - Evropská akce

  • UT code for WoS article