Multilingual Embeddings for Clustering Cultural Events
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AR8TWD2E9" target="_blank" >RIV/00216208:11320/22:R8TWD2E9 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1007/978-3-031-16500-9_8" target="_blank" >https://doi.org/10.1007/978-3-031-16500-9_8</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-16500-9_8" target="_blank" >10.1007/978-3-031-16500-9_8</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Multilingual Embeddings for Clustering Cultural Events
Popis výsledku v původním jazyce
In the present paper we describe our approach to semi-automatic text annotation based on clustering. Given a large collection of announcements of cultural events from several websites, we group them based on their content and infer respective semantic categories that can be used for annotation (e.g. lecture, sports, food, music). We experiment with various models for vectorising the texts, including pretrained multilingual Sentence Transformers and multilingual ELMo models. The produced text embeddings are then clustered using K-means. We evaluate our clustering results using a stratified sample of texts with pre-existing categories (collected from websites listing the events) as well as intrinsic evaluation measures. The rationale behind this work is to produce a single categorisation covering texts from various sources and in two languages - English and Russian. The labelled collection of texts is intended for use in a Digital Humanities project aimed at describing cultural life in a selected location, for example, comparing types of events in Russian and British cities.
Název v anglickém jazyce
Multilingual Embeddings for Clustering Cultural Events
Popis výsledku anglicky
In the present paper we describe our approach to semi-automatic text annotation based on clustering. Given a large collection of announcements of cultural events from several websites, we group them based on their content and infer respective semantic categories that can be used for annotation (e.g. lecture, sports, food, music). We experiment with various models for vectorising the texts, including pretrained multilingual Sentence Transformers and multilingual ELMo models. The produced text embeddings are then clustered using K-means. We evaluate our clustering results using a stratified sample of texts with pre-existing categories (collected from websites listing the events) as well as intrinsic evaluation measures. The rationale behind this work is to produce a single categorisation covering texts from various sources and in two languages - English and Russian. The labelled collection of texts is intended for use in a Digital Humanities project aimed at describing cultural life in a selected location, for example, comparing types of events in Russian and British cities.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Analysis of Images, Social Networks and Texts
ISBN
978-3-031-16500-9
ISSN
—
e-ISSN
—
Počet stran výsledku
13
Strana od-do
84-96
Název nakladatele
Springer International Publishing
Místo vydání
—
Místo konání akce
Cham
Datum konání akce
1. 1. 2022
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—