Open dataset discovery using context-enhanced similarity search
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21240%2F22%3A00359555" target="_blank" >RIV/68407700:21240/22:00359555 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1007/s10115-022-01751-z" target="_blank" >https://doi.org/10.1007/s10115-022-01751-z</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10115-022-01751-z" target="_blank" >10.1007/s10115-022-01751-z</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Open dataset discovery using context-enhanced similarity search
Popis výsledku v původním jazyce
Today, open data catalogs enable users to search for datasets with full-text queries in metadata records combined with simple faceted filtering. Using this combination, a user is able to discover a significant number of the datasets relevant to a user’s search intent. However, there still remain relevant datasets that are hard to find because of the enormous sparsity of their metadata (e.g., several keywords). As an alternative, in this paper, we propose an approach to dataset discovery based on similarity search over metadata descriptions enhanced by various semantic contexts. In general, the semantic contexts enrich the dataset metadata in a way that enables the identification of additional relevant datasets to a query that could not be retrieved using just the keyword or full-text search. In experimental evaluation we show that context-enhanced similarity retrieval methods increase the findability of relevant datasets, improving thus the retrieval recall that is critical in dataset discovery scenarios. As a part of the evaluation, we created a catalog-like user interface for dataset discovery and recorded streams of user actions that served us to create the ground truth. For the sake of reproducibility, we published the entire evaluation testbed.
Název v anglickém jazyce
Open dataset discovery using context-enhanced similarity search
Popis výsledku anglicky
Today, open data catalogs enable users to search for datasets with full-text queries in metadata records combined with simple faceted filtering. Using this combination, a user is able to discover a significant number of the datasets relevant to a user’s search intent. However, there still remain relevant datasets that are hard to find because of the enormous sparsity of their metadata (e.g., several keywords). As an alternative, in this paper, we propose an approach to dataset discovery based on similarity search over metadata descriptions enhanced by various semantic contexts. In general, the semantic contexts enrich the dataset metadata in a way that enables the identification of additional relevant datasets to a query that could not be retrieved using just the keyword or full-text search. In experimental evaluation we show that context-enhanced similarity retrieval methods increase the findability of relevant datasets, improving thus the retrieval recall that is critical in dataset discovery scenarios. As a part of the evaluation, we created a catalog-like user interface for dataset discovery and recorded streams of user actions that served us to create the ground truth. For the sake of reproducibility, we published the entire evaluation testbed.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Knowledge and Information Systems
ISSN
0219-1377
e-ISSN
0219-3116
Svazek periodika
64
Číslo periodika v rámci svazku
12
Stát vydavatele periodika
DE - Spolková republika Německo
Počet stran výsledku
27
Strana od-do
3265-3291
Kód UT WoS článku
000849677000001
EID výsledku v databázi Scopus
2-s2.0-85137453544