An Approach for Textual Based Clustering Using Word Embedding
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F21%3A10442329" target="_blank" >RIV/00216208:11320/21:10442329 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
An Approach for Textual Based Clustering Using Word Embedding
Original language description
Numerous endeavors have been made to improve the retrieval procedure in Textual Case-Based Reasoning (TCBR) utilizing clustering and feature selection strategies. SOPHisticated Information Analysis (SOPHIA) approach is one of the most successful efforts which is characterized by its ability to work without the domain of knowledge or language dependency. SOPHIA is based on the conditional probability, which facilitates an advanced Knowledge Discovery (KD) framework for case-based retrieval. SOPHIA attracts clusters by themes which contain only one word in each. However, using one word is not sufficient to construct cluster attractors because the exclusion of the other words associated with that word in the same context could not give a full picture of the theme. The main contribution of this chapter is to introduce an enhanced clustering approach called GloSOPHIA (GloVe SOPHIA) that extends SOPHIA by integrating word embedding technique to enhance KD in TCBR. A new algorithm is proposed to feed SOPHIA with similar terms vector space gained from Global Vector (GloVe) embedding technique. The proposed approach is evaluated on two different language corpora and the results are compared with SOPHIA, K-means, and Self- Organizing Map (SOM) in several evaluation criteria. The results indicate that GloSOPHIA outperforms the other clustering methods in most of the evaluation criteria.
Czech name
—
Czech description
—
Classification
Type
C - Chapter in a specialist book
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2021
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Book/collection name
Machine Learning and Big Data Analytics Paradigms: Analysis, Applications and Challenges
ISBN
978-3-030-59337-7
Number of pages of the result
20
Pages from-to
261-280
Number of pages of the book
323
Publisher name
Springer
Place of publication
Cham
UT code for WoS chapter
—