Trigram-based Vietnamese text compression
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F16%3A86099110" target="_blank" >RIV/61989100:27240/16:86099110 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >http://dx.doi.org/10.1007/978-3-319-31277-4_26</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >10.1007/978-3-319-31277-4_26</a>
Alternative languages
Result language
angličtina
Original language name
Trigram-based Vietnamese text compression
Original language description
This paper presents a new and efficient method for text compression using tri-grams dictionary. There have been many methods proposed to text compression such as: run length coding, Huffman coding, Lempel-Ziv-Welch (LZW) coding. Most of them have based on frequency of occurrence of letters in the text. In this paper, we propose a method to compress text using tri-grams dictionary. Our method firstly splits text to tri-gram then we encode it based on tri-grams dictionary, with each tri-gram, we use 4 bytes to encode. In this paper, we use Vietnamese text to evaluate our method. We collect text corpus from internet to build tri-grams dictionary. The size of text corpus is around 2.15 GB and the number of tri-grams in dictionary is more than 74,400,000 tri-grams. To evaluate our method, we collect a testing set of 10 different text files with different sizes to test our system. Experimental results show that our method achieves better results with compression ratio around 82 %. In comparison with WinZIP version 19.5 (http://www.winzip.com/win/en/index.htm) (the software combines LZ77 (Ziv and Lempel in IEEE Trans Inf Theory 24(5), 530-536, 1978 [20]) and Huffman coding) and WinRAR version 5.21 (http://www.rarlab.com/download.htm) (the software combines LZSS (Storer and Szymanski in J ACM 29(4), 928-951, 1982 [17]) and Prediction by Partial Matching [2]), our method achieves a higher compression ratio applied for any size of text in our test cases. (C) Springer International Publishing Switzerland 2016.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2016
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Studies in Computational Intelligence. Volume 642
ISBN
978-3-319-31276-7
ISSN
1860-949X
e-ISSN
—
Number of pages
11
Pages from-to
297-307
Publisher name
Springer Verlag
Place of publication
London
Event location
Danang
Event date
Mar 14, 2016
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—