Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989592%3A15410%2F20%3A73606106" target="_blank" >RIV/61989592:15410/20:73606106 - isvavai.cz</a>
Výsledek na webu
<a href="https://obd.upol.cz/id_publ/333185992" target="_blank" >https://obd.upol.cz/id_publ/333185992</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/SMAP49528.2020.9248454" target="_blank" >10.1109/SMAP49528.2020.9248454</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm
Popis výsledku v původním jazyce
A common task in a world of natural language processing is text classification useful for e.g.spam filters, documents sorting, science articles classification or plagiarism detection. This can still be done best and most accurately by human, on the other hand, we can of ten accept certain error in the classification in exchange for its speed. Here, natural language processing mechanism transforms the text in natural language to a form understandable by a classifier such as K-Nearest Neighbour, Decision Trees, Artificial Neural Network or Support Vector Machines. We can also use thishuman element to help automated classification to improve its accuracy by means of crowdsourcing. This work deals with classification of text documents and its improvement through crowdsourcing. Itsgoal is to design and implement text documents classifier prototype based on documents similarityand to design evaluation and crowdsourcing-based classification improvement mechanism. For classification the N-grams algorithm has been chosen, which was implemented in Java. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful. We have tested our approach on two data sets with promising preliminary results even across different languages. This led to a real-world implementation started at the beginning of 2019 in cooperation of two universities: VšB-TUO and OSU.
Název v anglickém jazyce
Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm
Popis výsledku anglicky
A common task in a world of natural language processing is text classification useful for e.g.spam filters, documents sorting, science articles classification or plagiarism detection. This can still be done best and most accurately by human, on the other hand, we can of ten accept certain error in the classification in exchange for its speed. Here, natural language processing mechanism transforms the text in natural language to a form understandable by a classifier such as K-Nearest Neighbour, Decision Trees, Artificial Neural Network or Support Vector Machines. We can also use thishuman element to help automated classification to improve its accuracy by means of crowdsourcing. This work deals with classification of text documents and its improvement through crowdsourcing. Itsgoal is to design and implement text documents classifier prototype based on documents similarityand to design evaluation and crowdsourcing-based classification improvement mechanism. For classification the N-grams algorithm has been chosen, which was implemented in Java. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful. We have tested our approach on two data sets with promising preliminary results even across different languages. This led to a real-world implementation started at the beginning of 2019 in cooperation of two universities: VšB-TUO and OSU.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2020
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
SMAP 2020 - 15th International Workshop on Semantic and Social Media Adaptation and Personalization
ISBN
978-1-72815-919-5
ISSN
—
e-ISSN
—
Počet stran výsledku
5
Strana od-do
1-6
Název nakladatele
IEEE Computer Society Press
Místo vydání
New York
Místo konání akce
Zakynthos
Datum konání akce
29. 10. 2020
Typ akce podle státní příslušnosti
EUR - Evropská akce
Kód UT WoS článku
—

Podobné výsledky(10)

Vylepšení klasifikace textových dokumentů algoritmem N-Grams pomocí crowdsourcingu Support of Informal Carers for People after a Stroke with Crowdsouurcing and Natural Language Processing Support of Informal Carers for People After a Stroke with Crowdsourcing and Natural Language Processing

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)