Czech MWE Database
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F08%3A00024204" target="_blank" >RIV/00216224:14330/08:00024204 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Czech MWE Database
Original language description
In this paper we deal with a recently developed large Czech MWE database containing at the moment 160 000 MWEs (treated as lexical units). We describe the structure of the database and give basic types of MWEs according to domains they belong to. We compare the built MWEs database with the corpus data from Czech National Corpus (approx. 100 mil. tokens) and present results of this comparison in the paper. To obtain a more complete list of MWEs we propose and use a technique exploiting the Word Sketch Engine, which allows us to work with statistical parameters such as frequency of MWEs and their components as well as with the salience for the whole MWEs. We also discuss exploitation of the database for working out a more adequate tagging and lemmatization. The final goal is to be able to recognize MWEs in corpus text and lemmatize them as complete lexical units, i. e. to make tagging and lemmatization more adequate.
Czech name
Česká databáze víceslovných vyrazů
Czech description
Článek popisuje strukturu a obsah české databáze víceslovných výrazů obsahující v současnosti více než 160 000 položek a porovnává ji s daty Českého národního korpusu. Dále je navrženo, jak databázi doplňovat pomocí Word Sketch Engine.
Classification
Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—
Result continuities
Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2008
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC '08)
ISBN
2-9517408-4-0
ISSN
—
e-ISSN
—
Number of pages
5
Pages from-to
—
Publisher name
European Language Resources Association (ELRA)
Place of publication
Marrakech, Morocco
Event location
Marrakech, Morocco
Event date
May 28, 2008
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—