Finding Terms in Corpora for Many Languages with the Sketch Engine
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F14%3A00075387" target="_blank" >RIV/00216224:14330/14:00075387 - isvavai.cz</a>
Result on the web
<a href="http://aclweb.org/anthology/E/E14/E14-2014.pdf" target="_blank" >http://aclweb.org/anthology/E/E14/E14-2014.pdf</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Finding Terms in Corpora for Many Languages with the Sketch Engine
Original language description
Term candidates for a domain, in a language, can be found by taking a corpus for the domain, and a refer- ence corpus for the language identifying the grammatical shape of a term in the language tokenising, lemmatising and POS-tagging both corpora identifying (and counting) the items in each corpus which match the grammatical shape for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/LM2010013" target="_blank" >LM2010013: LINDAT-CLARIN: Institute for analysis, processing and distribution of linguistic data</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2014
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics
ISBN
9781937284756
ISSN
—
e-ISSN
—
Number of pages
4
Pages from-to
53-56
Publisher name
The Association for Computational Linguistics
Place of publication
Gothenburg, Sweden
Event location
Gothenburg, Sweden
Event date
Jan 1, 2014
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—