OpusTools and Parallel Corpus Diagnostics
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F20%3A10427008" target="_blank" >RIV/00216208:11320/20:10427008 - isvavai.cz</a>
Result on the web
<a href="https://www.aclweb.org/anthology/2020.lrec-1.467" target="_blank" >https://www.aclweb.org/anthology/2020.lrec-1.467</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
OpusTools and Parallel Corpus Diagnostics
Original language description
This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.
Czech name
—
Czech description
—
Classification
Type
O - Miscellaneous
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2020
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů