Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F46747885%3A24220%2F15%3A%230003432" target="_blank" >RIV/46747885:24220/15:#0003432 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/46747885:24220/15:00002977
Výsledek na webu
—
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles
Popis výsledku v původním jazyce
This paper deals with methods for automatic topic-based clustering of news articles. For this purpose, three approaches are evaluated experimentally using two different test sets that are both compiled from Czech newspaper articles. The first adopted clustering scheme utilizes conventional approach: Latent Semantic Analysis (LSA). In contrast, the remaining two approaches were introduced recently and utilize Random Manhattan Indexing (RMI) or Skip-gram model (SGM) to obtain vector representation of input words and/or documents. Our experimental results show that both the new vectorization methods are for topic-based clustering well suitable and that their lead to even better results than LSA. The best clustering accuracy was reached by SGM and it is by17 % lower than the accuracy of human annotators. On the other hand, RMI yielded just slightly worse results than SGM and has one important advantage for practical usage: it can handle out of vocabulary (OOV) words without the need for t
Název v anglickém jazyce
Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles
Popis výsledku anglicky
This paper deals with methods for automatic topic-based clustering of news articles. For this purpose, three approaches are evaluated experimentally using two different test sets that are both compiled from Czech newspaper articles. The first adopted clustering scheme utilizes conventional approach: Latent Semantic Analysis (LSA). In contrast, the remaining two approaches were introduced recently and utilize Random Manhattan Indexing (RMI) or Skip-gram model (SGM) to obtain vector representation of input words and/or documents. Our experimental results show that both the new vectorization methods are for topic-based clustering well suitable and that their lead to even better results than LSA. The best clustering accuracy was reached by SGM and it is by17 % lower than the accuracy of human annotators. On the other hand, RMI yielded just slightly worse results than SGM and has one important advantage for practical usage: it can handle out of vocabulary (OOV) words without the need for t
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
JC - Počítačový hardware a software
OECD FORD obor
—
Návaznosti výsledku
Projekt
<a href="/cs/project/TA04010199" target="_blank" >TA04010199: MULTILINMEDIA - Multilinguální platforma pro monitoring a analýzu multimédií</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Ostatní
Rok uplatnění
2015
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics
ISBN
978-83-932640-8-7
ISSN
—
e-ISSN
—
Počet stran výsledku
5
Strana od-do
530-534
Název nakladatele
Fundancja Uniwersytetu im. Adama Mickiewicza w Poznaniu
Místo vydání
Polsko
Místo konání akce
Polsko, Poznaň
Datum konání akce
1. 1. 2015
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—