Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A10475930" target="_blank" >RIV/00216208:11320/23:10475930 - isvavai.cz</a>
Result on the web
<a href="http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-long/cdrom/pdf/2023.ijcnlp-long.57.pdf" target="_blank" >http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-long/cdrom/pdf/2023.ijcnlp-long.57.pdf</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
Original language description
We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer's vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish "doctora" for "female doctor") tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model's training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/GA23-06912S" target="_blank" >GA23-06912S: Identification and Prevention of Unwanted Gender Bias in Neural Language Models</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
ISBN
979-8-89176-014-1
ISSN
—
e-ISSN
—
Number of pages
12
Pages from-to
885-896
Publisher name
Association for Computational Linguistics
Place of publication
Stroudsburg, PA, USA
Event location
Nusa Dua, Bali, Indonesia
Event date
Nov 1, 2023
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—