Pivot-based approximate k-NN similarity joins for big high-dimensional data

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F20%3A10401674" target="_blank" >RIV/00216208:11320/20:10401674 - isvavai.cz</a>
Výsledek na webu
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=n90S2QtvBi" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=n90S2QtvBi</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.is.2019.06.006" target="_blank" >10.1016/j.is.2019.06.006</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Pivot-based approximate k-NN similarity joins for big high-dimensional data
Popis výsledku v původním jazyce
Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time. (C) 2019 Elsevier Ltd. All rights reserved.
Název v anglickém jazyce
Pivot-based approximate k-NN similarity joins for big high-dimensional data
Popis výsledku anglicky
Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time. (C) 2019 Elsevier Ltd. All rights reserved.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
<a href="/cs/project/GA17-22224S" target="_blank" >GA17-22224S: Analytika uživatelských preferencí v modelech multimediální explorace</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2020
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Information Systems
ISSN
0306-4379
e-ISSN
—
Svazek periodika
87
Číslo periodika v rámci svazku
January 2020
Stát vydavatele periodika
GB - Spojené království Velké Británie a Severního Irska
Počet stran výsledku
18
Strana od-do
101410
Kód UT WoS článku
000495488600006
EID výsledku v databázi Scopus
2-s2.0-85070216906

Podobné výsledky(10)

Comparing MapReduce-Based k-NN Similarity Joins On Hadoop For High-dimensional Data Big geo data surface approximation using radial basis functions: A comparative study Incremental Meshfree Approximation of Real Geographic Data

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Pivot-based approximate k-NN similarity joins for big high-dimensional data

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)