Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61988987%3A17610%2F22%3AA2302FNM" target="_blank" >RIV/61988987:17610/22:A2302FNM - isvavai.cz</a>
Výsledek na webu
<a href="http://ceur-ws.org/Vol-3226/" target="_blank" >http://ceur-ws.org/Vol-3226/</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus
Popis výsledku v původním jazyce
Although contextual word representations produced by transformer-based language models (e.g., BERT) have proven to be very successful in different kinds of NLP tasks, there is still little knowledge about how these contextual embeddings are connected to word meanings or semantic features. In this article, we provide a quantitative analysis of the semantic vector space induced by the XLM-RoBERTa model and the Wikicorpus. We study the geometric properties of vector embeddings of selected words. We use HDBSCAN clustering algorithm and propose a score called Cluster Dispersion Score which reflects how disperse is the collection of clusters. Our analysis shows that the number of meanings of a word is not directly correlated with the dispersion of embeddings of this word in the semantic vector space induced by the language model and a corpus. Some observations about the division of clusters of embeddings for several selected words are provided.
Název v anglickém jazyce
Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus
Popis výsledku anglicky
Although contextual word representations produced by transformer-based language models (e.g., BERT) have proven to be very successful in different kinds of NLP tasks, there is still little knowledge about how these contextual embeddings are connected to word meanings or semantic features. In this article, we provide a quantitative analysis of the semantic vector space induced by the XLM-RoBERTa model and the Wikicorpus. We study the geometric properties of vector embeddings of selected words. We use HDBSCAN clustering algorithm and propose a score called Cluster Dispersion Score which reflects how disperse is the collection of clusters. Our analysis shows that the number of meanings of a word is not directly correlated with the dispersion of embeddings of this word in the semantic vector space induced by the language model and a corpus. Some observations about the division of clusters of embeddings for several selected words are provided.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
ITAT 2022. Information Technologies - Applications and Theory 2022
ISBN
—
ISSN
1613-0073
e-ISSN
—
Počet stran výsledku
8
Strana od-do
103-110
Název nakladatele
CEUR-WS
Místo vydání
Aachen
Místo konání akce
Zuberec
Datum konání akce
23. 9. 2022
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—