Lexically Grounded Subword Segmentation
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F24%3A10492880" target="_blank" >RIV/00216208:11320/24:10492880 - isvavai.cz</a>
Result on the web
<a href="https://aclanthology.org/2024.emnlp-main.421/" target="_blank" >https://aclanthology.org/2024.emnlp-main.421/</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Lexically Grounded Subword Segmentation
Original language description
We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure considers lexical meaning. Third, we introduce an efficient segmentation algorithm based on a subword bigram model that can be initialized with the lexically aware segmentation method to avoid using Morfessor and large embedding tables at inference time. We evaluate the proposed approaches using two intrinsic metrics and measure their performance on two downstream tasks: part-of-speech tagging and machine translation. Our experiments show significant improvements in the morphological plausibility of the segmentation when evaluated using segmentation precision on morpheme boundaries and improved Rényi efficiency in
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)
ISBN
979-8-89176-164-3
ISSN
—
e-ISSN
—
Number of pages
18
Pages from-to
7403-7420
Publisher name
Association for Computational Linguistics
Place of publication
Kerrville, TX, USA
Event location
Miami, FL, USA
Event date
Nov 12, 2024
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—