Trigram-based Vietnamese text compression

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F16%3A86099110" target="_blank" >RIV/61989100:27240/16:86099110 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >http://dx.doi.org/10.1007/978-3-319-31277-4_26</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >10.1007/978-3-319-31277-4_26</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Trigram-based Vietnamese text compression
Popis výsledku v původním jazyce
This paper presents a new and efficient method for text compression using tri-grams dictionary. There have been many methods proposed to text compression such as: run length coding, Huffman coding, Lempel-Ziv-Welch (LZW) coding. Most of them have based on frequency of occurrence of letters in the text. In this paper, we propose a method to compress text using tri-grams dictionary. Our method firstly splits text to tri-gram then we encode it based on tri-grams dictionary, with each tri-gram, we use 4 bytes to encode. In this paper, we use Vietnamese text to evaluate our method. We collect text corpus from internet to build tri-grams dictionary. The size of text corpus is around 2.15 GB and the number of tri-grams in dictionary is more than 74,400,000 tri-grams. To evaluate our method, we collect a testing set of 10 different text files with different sizes to test our system. Experimental results show that our method achieves better results with compression ratio around 82 %. In comparison with WinZIP version 19.5 (http://www.winzip.com/win/en/index.htm) (the software combines LZ77 (Ziv and Lempel in IEEE Trans Inf Theory 24(5), 530-536, 1978 [20]) and Huffman coding) and WinRAR version 5.21 (http://www.rarlab.com/download.htm) (the software combines LZSS (Storer and Szymanski in J ACM 29(4), 928-951, 1982 [17]) and Prediction by Partial Matching [2]), our method achieves a higher compression ratio applied for any size of text in our test cases. (C) Springer International Publishing Switzerland 2016.
Název v anglickém jazyce
Trigram-based Vietnamese text compression
Popis výsledku anglicky
This paper presents a new and efficient method for text compression using tri-grams dictionary. There have been many methods proposed to text compression such as: run length coding, Huffman coding, Lempel-Ziv-Welch (LZW) coding. Most of them have based on frequency of occurrence of letters in the text. In this paper, we propose a method to compress text using tri-grams dictionary. Our method firstly splits text to tri-gram then we encode it based on tri-grams dictionary, with each tri-gram, we use 4 bytes to encode. In this paper, we use Vietnamese text to evaluate our method. We collect text corpus from internet to build tri-grams dictionary. The size of text corpus is around 2.15 GB and the number of tri-grams in dictionary is more than 74,400,000 tri-grams. To evaluate our method, we collect a testing set of 10 different text files with different sizes to test our system. Experimental results show that our method achieves better results with compression ratio around 82 %. In comparison with WinZIP version 19.5 (http://www.winzip.com/win/en/index.htm) (the software combines LZ77 (Ziv and Lempel in IEEE Trans Inf Theory 24(5), 530-536, 1978 [20]) and Huffman coding) and WinRAR version 5.21 (http://www.rarlab.com/download.htm) (the software combines LZSS (Storer and Szymanski in J ACM 29(4), 928-951, 1982 [17]) and Prediction by Partial Matching [2]), our method achieves a higher compression ratio applied for any size of text in our test cases. (C) Springer International Publishing Switzerland 2016.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
IN - Informatika
OECD FORD obor
—

Návaznosti výsledku

Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach

Ostatní

Rok uplatnění
2016
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Studies in Computational Intelligence. Volume 642
ISBN
978-3-319-31276-7
ISSN
1860-949X
e-ISSN
—
Počet stran výsledku
11
Strana od-do
297-307
Název nakladatele
Springer Verlag
Místo vydání
London
Místo konání akce
Danang
Datum konání akce
14. 3. 2016
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

N-Gram-Based Text Compression On the Method of Lossless Data Compression using Spans of varied Bit Widths A syllable-based method for Vietnamese text compression

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Trigram-based Vietnamese text compression

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)