Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61988987%3A17610%2F22%3AA2302FNM" target="_blank" >RIV/61988987:17610/22:A2302FNM - isvavai.cz</a>
Result on the web
<a href="http://ceur-ws.org/Vol-3226/" target="_blank" >http://ceur-ws.org/Vol-3226/</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus
Original language description
Although contextual word representations produced by transformer-based language models (e.g., BERT) have proven to be very successful in different kinds of NLP tasks, there is still little knowledge about how these contextual embeddings are connected to word meanings or semantic features. In this article, we provide a quantitative analysis of the semantic vector space induced by the XLM-RoBERTa model and the Wikicorpus. We study the geometric properties of vector embeddings of selected words. We use HDBSCAN clustering algorithm and propose a score called Cluster Dispersion Score which reflects how disperse is the collection of clusters. Our analysis shows that the number of meanings of a word is not directly correlated with the dispersion of embeddings of this word in the semantic vector space induced by the language model and a corpus. Some observations about the division of clusters of embeddings for several selected words are provided.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
ITAT 2022. Information Technologies - Applications and Theory 2022
ISBN
—
ISSN
1613-0073
e-ISSN
—
Number of pages
8
Pages from-to
103-110
Publisher name
CEUR-WS
Place of publication
Aachen
Event location
Zuberec
Event date
Sep 23, 2022
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—