Pivot-based approximate k-NN similarity joins for big high-dimensional data

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F20%3A10401674" target="_blank" >RIV/00216208:11320/20:10401674 - isvavai.cz</a>
Result on the web
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=n90S2QtvBi" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=n90S2QtvBi</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.is.2019.06.006" target="_blank" >10.1016/j.is.2019.06.006</a>

Alternative languages

Result language
angličtina
Original language name
Pivot-based approximate k-NN similarity joins for big high-dimensional data
Original language description
Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time. (C) 2019 Elsevier Ltd. All rights reserved.
Czech name
—
Czech description
—

Classification

Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

Project
<a href="/en/project/GA17-22224S" target="_blank" >GA17-22224S: User preference analytics in multimedia exploration models</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

Publication year
2020
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Name of the periodical
Information Systems
ISSN
0306-4379
e-ISSN
—
Volume of the periodical
87
Issue of the periodical within the volume
January 2020
Country of publishing house
GB - UNITED KINGDOM
Number of pages
18
Pages from-to
101410
UT code for WoS article
000495488600006
EID of the result in the Scopus database
2-s2.0-85070216906

Similar results(10)

Comparing MapReduce-Based k-NN Similarity Joins On Hadoop For High-dimensional Data Big geo data surface approximation using radial basis functions: A comparative study Incremental Meshfree Approximation of Real Geographic Data

What are you looking for?

Quick search

Smart search

Pivot-based approximate k-NN similarity joins for big high-dimensional data

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)