Morphological and Language-Agnostic Word Segmentation for NMT

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F18%3A10390183" target="_blank" >RIV/00216208:11320/18:10390183 - isvavai.cz</a>
Výsledek na webu
<a href="https://link.springer.com/book/10.1007/978-3-030-00794-2" target="_blank" >https://link.springer.com/book/10.1007/978-3-030-00794-2</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-030-00794-2_30" target="_blank" >10.1007/978-3-030-00794-2_30</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Morphological and Language-Agnostic Word Segmentation for NMT
Popis výsledku v původním jazyce
The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we identify a critical difference between BPE and STE and show a simple pre-processing step for BPE that considerably increases translation quality as evaluated by automatic measures.
Název v anglickém jazyce
Morphological and Language-Agnostic Word Segmentation for NMT
Popis výsledku anglicky
The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we identify a critical difference between BPE and STE and show a simple pre-processing step for BPE that considerably increases translation quality as evaluated by automatic measures.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2018
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Proceedings of the 21st International Conference on Text, Speech and Dialogue—TSD 2018
ISBN
978-3-030-00794-2
ISSN
1611-3349
e-ISSN
neuvedeno
Počet stran výsledku
8
Strana od-do
277-284
Název nakladatele
Springer-Verlag
Místo vydání
Cham, Switzerland
Místo konání akce
Brno, Czechia
Datum konání akce
11. 9. 2018
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

On the Linguistic Representational Power of Neural Machine Translation Models Tokenization with Factorized Subword Encoding Subwords-Only Alternatives to fastText for Morphologically Rich Languages

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Morphological and Language-Agnostic Word Segmentation for NMT

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)