Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F46747885%3A24220%2F15%3A%230003432" target="_blank" >RIV/46747885:24220/15:#0003432 - isvavai.cz</a>
Alternative codes found
RIV/46747885:24220/15:00002977
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles
Original language description
This paper deals with methods for automatic topic-based clustering of news articles. For this purpose, three approaches are evaluated experimentally using two different test sets that are both compiled from Czech newspaper articles. The first adopted clustering scheme utilizes conventional approach: Latent Semantic Analysis (LSA). In contrast, the remaining two approaches were introduced recently and utilize Random Manhattan Indexing (RMI) or Skip-gram model (SGM) to obtain vector representation of input words and/or documents. Our experimental results show that both the new vectorization methods are for topic-based clustering well suitable and that their lead to even better results than LSA. The best clustering accuracy was reached by SGM and it is by17 % lower than the accuracy of human annotators. On the other hand, RMI yielded just slightly worse results than SGM and has one important advantage for practical usage: it can handle out of vocabulary (OOV) words without the need for t
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
JC - Computer hardware and software
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/TA04010199" target="_blank" >TA04010199: MULTILINMEDIA - Multilingual Multimedia Monitoring and Analyzing Platform</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Others
Publication year
2015
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics
ISBN
978-83-932640-8-7
ISSN
—
e-ISSN
—
Number of pages
5
Pages from-to
530-534
Publisher name
Fundancja Uniwersytetu im. Adama Mickiewicza w Poznaniu
Place of publication
Polsko
Event location
Polsko, Poznaň
Event date
Jan 1, 2015
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—