Multilingual Embeddings for Clustering Cultural Events
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AR8TWD2E9" target="_blank" >RIV/00216208:11320/22:R8TWD2E9 - isvavai.cz</a>
Result on the web
<a href="https://doi.org/10.1007/978-3-031-16500-9_8" target="_blank" >https://doi.org/10.1007/978-3-031-16500-9_8</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-16500-9_8" target="_blank" >10.1007/978-3-031-16500-9_8</a>
Alternative languages
Result language
angličtina
Original language name
Multilingual Embeddings for Clustering Cultural Events
Original language description
In the present paper we describe our approach to semi-automatic text annotation based on clustering. Given a large collection of announcements of cultural events from several websites, we group them based on their content and infer respective semantic categories that can be used for annotation (e.g. lecture, sports, food, music). We experiment with various models for vectorising the texts, including pretrained multilingual Sentence Transformers and multilingual ELMo models. The produced text embeddings are then clustered using K-means. We evaluate our clustering results using a stratified sample of texts with pre-existing categories (collected from websites listing the events) as well as intrinsic evaluation measures. The rationale behind this work is to produce a single categorisation covering texts from various sources and in two languages - English and Russian. The labelled collection of texts is intended for use in a Digital Humanities project aimed at describing cultural life in a selected location, for example, comparing types of events in Russian and British cities.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Analysis of Images, Social Networks and Texts
ISBN
978-3-031-16500-9
ISSN
—
e-ISSN
—
Number of pages
13
Pages from-to
84-96
Publisher name
Springer International Publishing
Place of publication
—
Event location
Cham
Event date
Jan 1, 2022
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—