Modular framework for similarity-based dataset discovery using external knowledge
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21240%2F22%3A00356040" target="_blank" >RIV/68407700:21240/22:00356040 - isvavai.cz</a>
Result on the web
<a href="https://doi.org/10.1108/DTA-09-2021-0261" target="_blank" >https://doi.org/10.1108/DTA-09-2021-0261</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1108/DTA-09-2021-0261" target="_blank" >10.1108/DTA-09-2021-0261</a>
Alternative languages
Result language
angličtina
Original language name
Modular framework for similarity-based dataset discovery using external knowledge
Original language description
Purpose Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth. Design/methodology/approach In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery. Findings The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework. Originality/value To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Data Technologies and Applications
ISSN
2514-9288
e-ISSN
2514-9318
Volume of the periodical
56
Issue of the periodical within the volume
4
Country of publishing house
GB - UNITED KINGDOM
Number of pages
30
Pages from-to
506-535
UT code for WoS article
000759634600001
EID of the result in the Scopus database
2-s2.0-85125073753