Morphological and Language-Agnostic Word Segmentation for NMT
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F18%3A10390183" target="_blank" >RIV/00216208:11320/18:10390183 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/book/10.1007/978-3-030-00794-2" target="_blank" >https://link.springer.com/book/10.1007/978-3-030-00794-2</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-030-00794-2_30" target="_blank" >10.1007/978-3-030-00794-2_30</a>
Alternative languages
Result language
angličtina
Original language name
Morphological and Language-Agnostic Word Segmentation for NMT
Original language description
The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we identify a critical difference between BPE and STE and show a simple pre-processing step for BPE that considerably increases translation quality as evaluated by automatic measures.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2018
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the 21st International Conference on Text, Speech and Dialogue—TSD 2018
ISBN
978-3-030-00794-2
ISSN
1611-3349
e-ISSN
neuvedeno
Number of pages
8
Pages from-to
277-284
Publisher name
Springer-Verlag
Place of publication
Cham, Switzerland
Event location
Brno, Czechia
Event date
Sep 11, 2018
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—