Unsupervised Document Classification and Topic Detection
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F17%3A43932650" target="_blank" >RIV/49777513:23520/17:43932650 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/chapter/10.1007%2F978-3-319-66429-3_75" target="_blank" >https://link.springer.com/chapter/10.1007%2F978-3-319-66429-3_75</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-66429-3_75" target="_blank" >10.1007/978-3-319-66429-3_75</a>
Alternative languages
Result language
angličtina
Original language name
Unsupervised Document Classification and Topic Detection
Original language description
This article presents a method for pre-processing the feature vectors representing text documents that are consequently classified using unsupervised methods. The main goal is to show that state-of-the-art classification methods can be improved by a certain data preparation process. The first method is a standard K-means clustering and the second Latent Dirichlet allocation (LDA) method. Both are widely used in text processing. The mentioned algorithms are applied to two data sets in two different languages. First of them, the 20NewsGroup is a widely used benchmark for classification of English documents. The second set was selected from the large body of Czech news articles and was used mainly to compare the performance of the tested methods also for the case of less frequently studied language. Furthermore, the unsupervised methods are also compared with the supervised ones in order to (in some sense) ascertain the upper-bound of the task.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
20205 - Automation and control systems
Result continuities
Project
<a href="/en/project/DG16P02B048" target="_blank" >DG16P02B048: System for permanent preservation of documentation and presentation of historical sources from the period of totalitarian regimes</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2017
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Speech and Computer 19th International Conference, SPECOM 2017, Hatfield, UK, September 12-16, 2017, Proceedings
ISBN
978-3-319-66428-6
ISSN
0302-9743
e-ISSN
neuvedeno
Number of pages
9
Pages from-to
748-756
Publisher name
Springer
Place of publication
Cham
Event location
Hatfield, Hertfordshire, United Kingdom
Event date
Sep 12, 2017
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—