On-line Searching in IUPAC nucleotide sequences
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21240%2F19%3A00338889" target="_blank" >RIV/68407700:21240/19:00338889 - isvavai.cz</a>
Result on the web
<a href="http://www.biostec.org/?y=2019" target="_blank" >http://www.biostec.org/?y=2019</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.5220/0007382900660077" target="_blank" >10.5220/0007382900660077</a>
Alternative languages
Result language
angličtina
Original language name
On-line Searching in IUPAC nucleotide sequences
Original language description
We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O (m + mα2 log m)-space data structure and O(mα2 ) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is bounded by O (nm2α4 ) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate time. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/EF16_019%2F0000765" target="_blank" >EF16_019/0000765: Research Center for Informatics</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2019
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019)
ISBN
978-989-758-353-7
ISSN
2184-4305
e-ISSN
—
Number of pages
12
Pages from-to
66-77
Publisher name
SCITEPRESS – Science and Technology Publications, Lda
Place of publication
Lisboa
Event location
Praha
Event date
Feb 22, 2019
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—