Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F21%3A10248617" target="_blank" >RIV/61989100:27240/21:10248617 - isvavai.cz</a>
Result on the web
<a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9583298" target="_blank" >https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9583298</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ACCESS.2021.3121749" target="_blank" >10.1109/ACCESS.2021.3121749</a>

Alternative languages

Result language
angličtina
Original language name
Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment
Original language description
The most common index data structures used by whole genome aligners (WGA) are based on suffix trees (ST), suffix arrays, and FM-indexes. These data structures show good performance results as WGA works with sequences of letters over small alphabets; for example, four letters a, c, t, g for DNA alignment. A novel whole genome aligner, which we are developing, will work with distances between the label sites on DNA samples, which are represented as a sequence of positive integers. The size of alphabet sigma is theoretically unlimited. This has prompted us to investigate if there is a better structure that would improve search performance on large alphabets compared to the commonly used suffix-based structures. This paper presents the implementation of a highly optimized hybrid index data structure based on a ternary search tree (TST) and hash tables, which perform much better when working with strings on large alphabets compared to the ST. Single core parallelism was achieved using advanced vector extensions (AVX) single instruction multiple data (SIMD) instruction set. When searching for short k-mers over an alphabet of 25, 695 letters, our index search performance was up to 29 times better than the search performance of the reference ST. When the alphabet was compressed to approximately 1; 300 letters, our index search performance was still up to 2.6 times better than the ST. The source code is available free on http://olgen.cz/Resources/Upload/Home/public/software/hds.zip under the MIT license.
Czech name
—
Czech description
—

Classification

Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences

Result continuities

Project
<a href="/en/project/NU20-06-00269" target="_blank" >NU20-06-00269: Utility of cellular profiles and proteomics of synovial fluid and periprosthetic tissues for clinical decision making in knee osteoarthritis</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

Publication year
2021
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Name of the periodical
IEEE Access
ISSN
2169-3536
e-ISSN
—
Volume of the periodical
9
Issue of the periodical within the volume
2021
Country of publishing house
US - UNITED STATES
Number of pages
8
Pages from-to
161890-161897
UT code for WoS article
000730465200001
EID of the result in the Scopus database
2-s2.0-85121784198

Similar results(10)

Towards faster matching algorithm using ternary tree in the area of genome mapping Testing the impact of Whole Genome Amplification on genome comparison using the polyploid flagellated Giardia duodenalis as a model On-line Searching in IUPAC nucleotide sequences

What are you looking for?

Quick search

Smart search

Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)