Regex Matching with Counting-Set Automata
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216305%3A26230%2F20%3APU138885" target="_blank" >RIV/00216305:26230/20:PU138885 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.microsoft.com/en-us/research/uploads/prod/2020/09/MSR-TR-2020-31.pdf" target="_blank" >https://www.microsoft.com/en-us/research/uploads/prod/2020/09/MSR-TR-2020-31.pdf</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1145/3428286" target="_blank" >10.1145/3428286</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Regex Matching with Counting-Set Automata
Popis výsledku v původním jazyce
We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that can hold sets of bounded integers and can be manipulated by a limited portfolio of constant-time operations. We present an algorithm that compiles a large sub-class of regexes to deterministic CsAs. This includes (1) a novel Antimirov-style translation of regexes with counting to counting automata (CAs), nondeterministic automata with bounded counters, and (2) our main technical contribution, a determinization of CAs that outputs CsAs. The main advantage of this workflow is that the size of the produced CsAs does not depend on the repetition bounds used in the regex (while the size of the DFA is exponential to them). Our experimental results confirm that deterministic CsAs produced from practical regexes with repetition are indeed vastly smaller than the corresponding DFAs. More importantly, our prototype matcher based on CsA simulation handles practical regexes with repetition regardless of sizes of counter bounds. It easily copes with regexes with repetition where state-of-the-art matchers struggle.
Název v anglickém jazyce
Regex Matching with Counting-Set Automata
Popis výsledku anglicky
We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that can hold sets of bounded integers and can be manipulated by a limited portfolio of constant-time operations. We present an algorithm that compiles a large sub-class of regexes to deterministic CsAs. This includes (1) a novel Antimirov-style translation of regexes with counting to counting automata (CAs), nondeterministic automata with bounded counters, and (2) our main technical contribution, a determinization of CAs that outputs CsAs. The main advantage of this workflow is that the size of the produced CsAs does not depend on the repetition bounds used in the regex (while the size of the DFA is exponential to them). Our experimental results confirm that deterministic CsAs produced from practical regexes with repetition are indeed vastly smaller than the corresponding DFAs. More importantly, our prototype matcher based on CsA simulation handles practical regexes with repetition regardless of sizes of counter bounds. It easily copes with regexes with repetition where state-of-the-art matchers struggle.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
<a href="/cs/project/LL1908" target="_blank" >LL1908: Efektivní konečné automaty pro automatické usuzování</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2020
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Proceedings of the ACM on Programming Languages
ISSN
2475-1421
e-ISSN
—
Svazek periodika
4
Číslo periodika v rámci svazku
11
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
30
Strana od-do
1-30
Kód UT WoS článku
000685203900095
EID výsledku v databázi Scopus
2-s2.0-85097577301