Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles

Popis výsledku

—

Klíčová slova

ASR language processing topic-based Clusterin

Identifikátory výsledku

Kód výsledku v IS VaVaI
RIV/46747885:24220/15:#0003432 - isvavai.cz
Nalezeny alternativní kódy
RIV/46747885:24220/15:00002977
Výsledek na webu
—
DOI - Digital Object Identifier
—

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles
Popis výsledku v původním jazyce
This paper deals with methods for automatic topic-based clustering of news articles. For this purpose, three approaches are evaluated experimentally using two different test sets that are both compiled from Czech newspaper articles. The first adopted clustering scheme utilizes conventional approach: Latent Semantic Analysis (LSA). In contrast, the remaining two approaches were introduced recently and utilize Random Manhattan Indexing (RMI) or Skip-gram model (SGM) to obtain vector representation of input words and/or documents. Our experimental results show that both the new vectorization methods are for topic-based clustering well suitable and that their lead to even better results than LSA. The best clustering accuracy was reached by SGM and it is by17 % lower than the accuracy of human annotators. On the other hand, RMI yielded just slightly worse results than SGM and has one important advantage for practical usage: it can handle out of vocabulary (OOV) words without the need for t
Název v anglickém jazyce
Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles
Popis výsledku anglicky
This paper deals with methods for automatic topic-based clustering of news articles. For this purpose, three approaches are evaluated experimentally using two different test sets that are both compiled from Czech newspaper articles. The first adopted clustering scheme utilizes conventional approach: Latent Semantic Analysis (LSA). In contrast, the remaining two approaches were introduced recently and utilize Random Manhattan Indexing (RMI) or Skip-gram model (SGM) to obtain vector representation of input words and/or documents. Our experimental results show that both the new vectorization methods are for topic-based clustering well suitable and that their lead to even better results than LSA. The best clustering accuracy was reached by SGM and it is by17 % lower than the accuracy of human annotators. On the other hand, RMI yielded just slightly worse results than SGM and has one important advantage for practical usage: it can handle out of vocabulary (OOV) words without the need for t

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
JC - Počítačový hardware a software
OECD FORD obor
—

Návaznosti výsledku

Projekt
TA04010199: MULTILINMEDIA - Multilinguální platforma pro monitoring a analýzu multimédií
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2015
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics
ISBN
978-83-932640-8-7
ISSN
—
e-ISSN
—
Počet stran výsledku
5
Strana od-do
530-534
Název nakladatele
Fundancja Uniwersytetu im. Adama Mickiewicza w Poznaniu
Místo vydání
Polsko
Místo konání akce
Polsko, Poznaň
Datum konání akce
1. 1. 2015
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Druh výsledku

D - Stať ve sborníku

CEP

JC - Počítačový hardware a software

Rok uplatnění

2015

Podobné výsledky(10)

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Non-metric cameras, SGM methods and their application in the field of surface mining and cadastral mapping An Approach for Textual Based Clustering Using Word Embedding

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Sdílet výsledky vyhledávání

Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles

Popis výsledku

Klíčová slova

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Sdílet výsledky vyhledávání

Popis výsledku

Klíčová slova

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Základní informace

Podobné výsledky(10)