Trigram-based Vietnamese text compression
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F16%3A86099110" target="_blank" >RIV/61989100:27240/16:86099110 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >http://dx.doi.org/10.1007/978-3-319-31277-4_26</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >10.1007/978-3-319-31277-4_26</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Trigram-based Vietnamese text compression
Popis výsledku v původním jazyce
This paper presents a new and efficient method for text compression using tri-grams dictionary. There have been many methods proposed to text compression such as: run length coding, Huffman coding, Lempel-Ziv-Welch (LZW) coding. Most of them have based on frequency of occurrence of letters in the text. In this paper, we propose a method to compress text using tri-grams dictionary. Our method firstly splits text to tri-gram then we encode it based on tri-grams dictionary, with each tri-gram, we use 4 bytes to encode. In this paper, we use Vietnamese text to evaluate our method. We collect text corpus from internet to build tri-grams dictionary. The size of text corpus is around 2.15 GB and the number of tri-grams in dictionary is more than 74,400,000 tri-grams. To evaluate our method, we collect a testing set of 10 different text files with different sizes to test our system. Experimental results show that our method achieves better results with compression ratio around 82 %. In comparison with WinZIP version 19.5 (http://www.winzip.com/win/en/index.htm) (the software combines LZ77 (Ziv and Lempel in IEEE Trans Inf Theory 24(5), 530-536, 1978 [20]) and Huffman coding) and WinRAR version 5.21 (http://www.rarlab.com/download.htm) (the software combines LZSS (Storer and Szymanski in J ACM 29(4), 928-951, 1982 [17]) and Prediction by Partial Matching [2]), our method achieves a higher compression ratio applied for any size of text in our test cases. (C) Springer International Publishing Switzerland 2016.
Název v anglickém jazyce
Trigram-based Vietnamese text compression
Popis výsledku anglicky
This paper presents a new and efficient method for text compression using tri-grams dictionary. There have been many methods proposed to text compression such as: run length coding, Huffman coding, Lempel-Ziv-Welch (LZW) coding. Most of them have based on frequency of occurrence of letters in the text. In this paper, we propose a method to compress text using tri-grams dictionary. Our method firstly splits text to tri-gram then we encode it based on tri-grams dictionary, with each tri-gram, we use 4 bytes to encode. In this paper, we use Vietnamese text to evaluate our method. We collect text corpus from internet to build tri-grams dictionary. The size of text corpus is around 2.15 GB and the number of tri-grams in dictionary is more than 74,400,000 tri-grams. To evaluate our method, we collect a testing set of 10 different text files with different sizes to test our system. Experimental results show that our method achieves better results with compression ratio around 82 %. In comparison with WinZIP version 19.5 (http://www.winzip.com/win/en/index.htm) (the software combines LZ77 (Ziv and Lempel in IEEE Trans Inf Theory 24(5), 530-536, 1978 [20]) and Huffman coding) and WinRAR version 5.21 (http://www.rarlab.com/download.htm) (the software combines LZSS (Storer and Szymanski in J ACM 29(4), 928-951, 1982 [17]) and Prediction by Partial Matching [2]), our method achieves a higher compression ratio applied for any size of text in our test cases. (C) Springer International Publishing Switzerland 2016.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
IN - Informatika
OECD FORD obor
—
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2016
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Studies in Computational Intelligence. Volume 642
ISBN
978-3-319-31276-7
ISSN
1860-949X
e-ISSN
—
Počet stran výsledku
11
Strana od-do
297-307
Název nakladatele
Springer Verlag
Místo vydání
London
Místo konání akce
Danang
Datum konání akce
14. 3. 2016
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—