All
All

What are you looking for?

All
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”
LD14117

Parsing and multi-word expressions. Towards linguistic precision and computational efficiency in natural language processing (PARSEME)

Project goals

The goal of the proposed project follows from the overall aim of the whole COST Action IC1207. This Action aims at increasing and enhancing the support of the European multilingual heritage from Information and Communication Technologies (ICT). This general aim is addressed through improving linguistic representativeness, precision and computational efficiency of Natural Language Processing (NLP) applications. The Action focuses on the major bottleneck of these applications: Multi-Word Expressions (MWEs), i.e. sequences of words with unpredictable properties such as to count somebody in or to take a haircut. A breakthrough in their modeling and processing can only result from a coordinated effort of multidisciplinary experts in different languages. COST is the most adequate framework answering this need. Fourteen European languages will be addressed from a cross-theoretical and cross-methodological perspective, necessary for coping with current fragmentation issues. Expected deliverables include enhanced language resources and tools, as well as recommendations of best practices for cutting-edge MWE-aware language models. The Action will lead to a better understanding of the nature of MWEs. It will establish a long-lasting collaboration within a multilingual network of MWE specialists. It will pave the way towards competitive next generation text processing tools which will pay greater attention to language phenomena. Specifically, the proposed project will concentrate on the specification of MWE annotation over a large corpus (while, naturally, focusing on the Czech language), annotation of Czech corpus and MWE extraction in the form of an electronic dictionary formatted for future NLP applications. Our goal is to publish all electronic language resources in an open way (using the CC license) for open access in future research as well as applications.

Keywords

Natural language processingCzech languagemultiword entitiesparsinganalysiscorpusdictionarylanguage resourceslanguage annotationmorphologysyntaxsemanticsmeaning

Public support

  • Provider

    Ministry of Education, Youth and Sports

  • Programme

    COST CZ

  • Call for proposals

    COST CZ 4 (SMSM2014LD4)

  • Main participants

    Univerzita Karlova / Matematicko-fyzikální fakulta

  • Contest type

    VS - Public tender

  • Contract ID

    MSMT-8634/2014-1

Alternative language

  • Project name in Czech

    PARSEME: Parsing a víceslovné výrazy - k jazykovědné přesnosti a výpočetní efektivitě ve zpracování přirozeného jazyka

  • Annotation in Czech

    Cílem projektu je ve spolupráci se zahraničními partnery a s využitím jejich zkušeností významně posílit vlastní výzkum v oblasti zpracování přirozeného jazyka, konkrétně v oblasti analýzy víceslovných výrazů (multiword entities, MWE), a to z hlediska tvaroslovného, syntaktického a zejména významového. Tento cíl má několik postupných (pod)cílů: metodologii výzkumu v této specifické oblasti ze začleněním dosud rozdrobených poznatků mezinárodního konsorcia partnerů, přípravu expertních lingvisticky anotovaných dat (textových korpusů analyzovaných z hlediska MWE), extrakci slovníku MWE z takto připravených dat, a přípravu pilotních experimentů identifikace MWE v textu. Vedlejším, nicméně velmi důležitým cílem z hlediska navazujícího výzkumu je příprava veřejně dostupných anotovaných dat a již zmíněného slovníku ve formátu vhodném pro další zkoumání a navazující metody strojového učení.

Scientific branches

  • R&D category

    ZV - Basic research

  • CEP classification - main branch

    AI - Linguistics

  • CEP - secondary branch

    IN - Informatics

  • CEP - another secondary branch

  • 10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
    60201 - General language studies
    60202 - Specific languages
    60203 - Linguistics

Completed project evaluation

  • Provider evaluation

    V - Vynikající výsledky projektu (s mezinárodním významem atd.)

  • Project results evaluation

    We have published 7 proceeding papers, a journal paper, PhD. thesis and an annotated corpus. A Training school took place in Prague. Strong international cooperation started, which is proven by our high attendance at international events, our activity in working groups, but mainly by continuing collaboration after the project funding has ended.

Solution timeline

  • Realization period - beginning

    Apr 1, 2014

  • Realization period - end

    Mar 31, 2017

  • Project status

    U - Finished project

  • Latest support payment

    Feb 28, 2017

Data delivery to CEP

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

  • Data delivery code

    CEP18-MSM-LD-U/01:1

  • Data delivery date

    Jun 12, 2018

Finance

  • Total approved costs

    2,152 thou. CZK

  • Public financial support

    2,152 thou. CZK

  • Other public sources

    0 thou. CZK

  • Non public and foreign sources

    0 thou. CZK

Basic information

Recognised costs

2 152 CZK thou.

Public support

2 152 CZK thou.

100%


Provider

Ministry of Education, Youth and Sports

CEP

AI - Linguistics

Solution period

01. 04. 2014 - 31. 03. 2017