Deep Learning Based Vietnamese Diacritics Restoration
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F19%3A10427050" target="_blank" >RIV/00216208:11320/19:10427050 - isvavai.cz</a>
Výsledek na webu
<a href="https://ieeexplore.ieee.org/document/8958999" target="_blank" >https://ieeexplore.ieee.org/document/8958999</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Deep Learning Based Vietnamese Diacritics Restoration
Popis výsledku v původním jazyce
Diacritics are very important in diacritical languages, because the meaning of sentences can be changed in accordance to diacritics. Writing without diacritics makes the sentences ambiguous; however, there are several reasons make people do not write words with diacritics, such as fast typing, convenience, or texting on unsupported diacritics devices. As a result, these texts are very difficult to process on further natural language processing (NLP) tasks like machine translation, sentiment analysis, or question answering system. Therefore, diacritics restoration is critical for further usage or processing in NLP related tasks. In this study, we propose a method which combines convolutional neural network (CNN) and bidirectional gated recurrent unit (Bi-GRU) to restore diacritics. In addition, we use residual block to resolve vanishing gradient problem of recurrent neural networks. We applied the model for restoring diacritics of Vietnamese language that has the highest ratio of diacritics in words. This approach has character accuracy at 98.63% and word accuracy at 94.77%.
Název v anglickém jazyce
Deep Learning Based Vietnamese Diacritics Restoration
Popis výsledku anglicky
Diacritics are very important in diacritical languages, because the meaning of sentences can be changed in accordance to diacritics. Writing without diacritics makes the sentences ambiguous; however, there are several reasons make people do not write words with diacritics, such as fast typing, convenience, or texting on unsupported diacritics devices. As a result, these texts are very difficult to process on further natural language processing (NLP) tasks like machine translation, sentiment analysis, or question answering system. Therefore, diacritics restoration is critical for further usage or processing in NLP related tasks. In this study, we propose a method which combines convolutional neural network (CNN) and bidirectional gated recurrent unit (Bi-GRU) to restore diacritics. In addition, we use residual block to resolve vanishing gradient problem of recurrent neural networks. We applied the model for restoring diacritics of Vietnamese language that has the highest ratio of diacritics in words. This approach has character accuracy at 98.63% and word accuracy at 94.77%.
Klasifikace
Druh
O - Ostatní výsledky
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2019
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů