Yet Another Language Identifier
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F12%3A10130078" target="_blank" >RIV/00216208:11320/12:10130078 - isvavai.cz</a>
Result on the web
<a href="http://aclweb.org/anthology-new/E/E12/E12-3006.pdf" target="_blank" >http://aclweb.org/anthology-new/E/E12/E12-3006.pdf</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Yet Another Language Identifier
Original language description
Language identification of written text has been studied for several decades. Despite this fact, most of the research is focused on a few most spoken languages, whereas the minor ones are ignored. The identification of a larger number of languages bringsnew difficulties that do not occur for a few languages. These difficulties are causing decreased accuracy. The objective of this paper is to investigate the sources of such degradation. In order to isolate the impact of individual factors, 5 different algorithms and 3 different number of languages are used. The Support Vector Machine algorithm achieved an accuracy of 98% for 90 languages and the YALI algorithm based on a scoring function had an accuracy of 95.4%. The YALI algorithm has slightly lower accuracy but classifies around 17 times faster and its training is more than 4000 times faster. Three different data sets with various number of languages and sample sizes were prepared to overcome the lack of standardized data sets. These
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/7E11042" target="_blank" >7E11042: Knowledge Helper for Medical and Other Information users</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2012
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
ISBN
978-1-937284-19-0
ISSN
—
e-ISSN
—
Number of pages
9
Pages from-to
46-54
Publisher name
Association for Computational Linguistics
Place of publication
Avignon, France
Event location
Avignon, France
Event date
Apr 23, 2012
Type of event by nationality
CST - Celostátní akce
UT code for WoS article
—