Trigram-based Vietnamese text compression

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F16%3A86099110" target="_blank" >RIV/61989100:27240/16:86099110 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >http://dx.doi.org/10.1007/978-3-319-31277-4_26</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-31277-4_26" target="_blank" >10.1007/978-3-319-31277-4_26</a>

Alternative languages

Result language
angličtina
Original language name
Trigram-based Vietnamese text compression
Original language description
This paper presents a new and efficient method for text compression using tri-grams dictionary. There have been many methods proposed to text compression such as: run length coding, Huffman coding, Lempel-Ziv-Welch (LZW) coding. Most of them have based on frequency of occurrence of letters in the text. In this paper, we propose a method to compress text using tri-grams dictionary. Our method firstly splits text to tri-gram then we encode it based on tri-grams dictionary, with each tri-gram, we use 4 bytes to encode. In this paper, we use Vietnamese text to evaluate our method. We collect text corpus from internet to build tri-grams dictionary. The size of text corpus is around 2.15 GB and the number of tri-grams in dictionary is more than 74,400,000 tri-grams. To evaluate our method, we collect a testing set of 10 different text files with different sizes to test our system. Experimental results show that our method achieves better results with compression ratio around 82 %. In comparison with WinZIP version 19.5 (http://www.winzip.com/win/en/index.htm) (the software combines LZ77 (Ziv and Lempel in IEEE Trans Inf Theory 24(5), 530-536, 1978 [20]) and Huffman coding) and WinRAR version 5.21 (http://www.rarlab.com/download.htm) (the software combines LZSS (Storer and Szymanski in J ACM 29(4), 928-951, 1982 [17]) and Prediction by Partial Matching [2]), our method achieves a higher compression ratio applied for any size of text in our test cases. (C) Springer International Publishing Switzerland 2016.
Czech name
—
Czech description
—

Classification

Type
D - Article in proceedings
CEP classification
IN - Informatics
OECD FORD branch
—

Result continuities

Project
—
Continuities
S - Specificky vyzkum na vysokych skolach

Others

Publication year
2016
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Article name in the collection
Studies in Computational Intelligence. Volume 642
ISBN
978-3-319-31276-7
ISSN
1860-949X
e-ISSN
—
Number of pages
11
Pages from-to
297-307
Publisher name
Springer Verlag
Place of publication
London
Event location
Danang
Event date
Mar 14, 2016
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—

Similar results(10)

N-Gram-Based Text Compression On the Method of Lossless Data Compression using Spans of varied Bit Widths A syllable-based method for Vietnamese text compression

What are you looking for?

Quick search

Smart search

Trigram-based Vietnamese text compression

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)