Downdating lexicon and language model for automatic transcription of Czech historical spoken documents
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F46747885%3A24220%2F13%3A%230002791" target="_blank" >RIV/46747885:24220/13:#0002791 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1007/978-3-642-40585-3_26" target="_blank" >http://dx.doi.org/10.1007/978-3-642-40585-3_26</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-642-40585-3_26" target="_blank" >10.1007/978-3-642-40585-3_26</a>
Alternative languages
Result language
angličtina
Original language name
Downdating lexicon and language model for automatic transcription of Czech historical spoken documents
Original language description
This paper deals with the task of adaptation of an existing Czech largevocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rud´e Pr´avo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we haveto apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ?downdating? (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
JC - Computer hardware and software
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/DF11P01OVV013" target="_blank" >DF11P01OVV013: Disclosure of the Czech Radio archive for sophisticated search</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2013
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
16th International Conference, TSD 2013
ISBN
9783642405846
ISSN
0302-9743
e-ISSN
—
Number of pages
8
Pages from-to
201-208
Publisher name
Springer-Verlag Berlin Heidelber
Place of publication
Germany, Berlin
Event location
Czech Republic, Pilsen
Event date
Sep 1, 2013
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—