All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F46747885%3A24220%2F15%3A%230003432" target="_blank" >RIV/46747885:24220/15:#0003432 - isvavai.cz</a>

  • Alternative codes found

    RIV/46747885:24220/15:00002977

  • Result on the web

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Study on Methods for Vector Representation of Text for Topic-based Clustering of News Articles

  • Original language description

    This paper deals with methods for automatic topic-based clustering of news articles. For this purpose, three approaches are evaluated experimentally using two different test sets that are both compiled from Czech newspaper articles. The first adopted clustering scheme utilizes conventional approach: Latent Semantic Analysis (LSA). In contrast, the remaining two approaches were introduced recently and utilize Random Manhattan Indexing (RMI) or Skip-gram model (SGM) to obtain vector representation of input words and/or documents. Our experimental results show that both the new vectorization methods are for topic-based clustering well suitable and that their lead to even better results than LSA. The best clustering accuracy was reached by SGM and it is by17 % lower than the accuracy of human annotators. On the other hand, RMI yielded just slightly worse results than SGM and has one important advantage for practical usage: it can handle out of vocabulary (OOV) words without the need for t

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

    JC - Computer hardware and software

  • OECD FORD branch

Result continuities

  • Project

    <a href="/en/project/TA04010199" target="_blank" >TA04010199: MULTILINMEDIA - Multilingual Multimedia Monitoring and Analyzing Platform</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Others

  • Publication year

    2015

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics

  • ISBN

    978-83-932640-8-7

  • ISSN

  • e-ISSN

  • Number of pages

    5

  • Pages from-to

    530-534

  • Publisher name

    Fundancja Uniwersytetu im. Adama Mickiewicza w Poznaniu

  • Place of publication

    Polsko

  • Event location

    Polsko, Poznaň

  • Event date

    Jan 1, 2015

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article