Word-Graph vs. bag-of-words feature extraction for solving author identification problem
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27510%2F19%3A10243815" target="_blank" >RIV/61989100:27510/19:10243815 - isvavai.cz</a>
Výsledek na webu
—
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Word-Graph vs. bag-of-words feature extraction for solving author identification problem
Popis výsledku v původním jazyce
In this paper we examine multiple methods for solving the problem of text vectorization in context of text classification. We compare two variants of a traditional Bag-of-Words technique to a newly proposed Word-Graph approach based on graph representation of a text document and measuring similarities between graph structures. We further propose modifications to the Word-Graph method potentially improving classification accuracy. Results of experiments performed while solving an author identification problem on a dataset consisting of speeches made during meetings of Slovak National Parliament show that the Word-Graph approach offers similar levels of accuracy as traditional methods. Proposed modifications significantly improve the performance in case of imbalanced number of documents for each class in the training set. (C) 2019 VSB-Technical University of Ostrava. All rights reserved.
Název v anglickém jazyce
Word-Graph vs. bag-of-words feature extraction for solving author identification problem
Popis výsledku anglicky
In this paper we examine multiple methods for solving the problem of text vectorization in context of text classification. We compare two variants of a traditional Bag-of-Words technique to a newly proposed Word-Graph approach based on graph representation of a text document and measuring similarities between graph structures. We further propose modifications to the Word-Graph method potentially improving classification accuracy. Results of experiments performed while solving an author identification problem on a dataset consisting of speeches made during meetings of Slovak National Parliament show that the Word-Graph approach offers similar levels of accuracy as traditional methods. Proposed modifications significantly improve the performance in case of imbalanced number of documents for each class in the training set. (C) 2019 VSB-Technical University of Ostrava. All rights reserved.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Ostatní
Rok uplatnění
2019
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Proceedings of the 13th International Conference on Strategic Management and its Support by Information Systems: May 21th-22th, 2019, Ostrava, Czech Republic
ISBN
978-80-248-4305-6
ISSN
2570-5776
e-ISSN
—
Počet stran výsledku
8
Strana od-do
418-425
Název nakladatele
VŠB - Technical University of Ostrava
Místo vydání
Ostrava
Místo konání akce
Ostrava
Datum konání akce
21. 5. 2019
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—