Segmentation from 97% to 100%: Is It Time for Some Linguistics?
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F12%3A00062085" target="_blank" >RIV/00216224:14330/12:00062085 - isvavai.cz</a>
Result on the web
<a href="http://www.fi.muni.cz/usr/sojka/papers/sojka-raslan2012.pdf" target="_blank" >http://www.fi.muni.cz/usr/sojka/papers/sojka-raslan2012.pdf</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Segmentation from 97% to 100%: Is It Time for Some Linguistics?
Original language description
Many tasks in natural language processing (NLP) require emph{segmentation} algorithms: segmentation of paragraph into sentences, segmentation of sentences into words is needed in languages like Chinese or Thai, segmentation of words into syllables (emph{hyphenation}) or into morphological parts (e.g. getting word stem for indexing), and many other tasks (e.g. tagging) could be formulated as segmentation problems. We evaluate methodology of using emph{competing patterns} for these tasks and decide on the complexity of creation of space-optimal (minimal) patterns that completely (100,%) implement the segmentation task. We formally define this task and prove that it is in the class of emph{non-polynomial} optimization problems. However, finding space-efficient competing patterns for real NLP tasks is feasible and gives efficient scalable solutions of segmentation task: segmentation is done in emph{constant} time with respect to the size of segmented dictionary.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/LA09016" target="_blank" >LA09016: Czech Republic membership in the European Research Consortium for Informatics and Mathematics (ERCIM)</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2012
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012
ISBN
9788026303138
ISSN
—
e-ISSN
—
Number of pages
11
Pages from-to
121-131
Publisher name
Tribun EU
Place of publication
Brno
Event location
Karlova Studánka
Event date
Dec 7, 2011
Type of event by nationality
EUR - Evropská akce
UT code for WoS article
—