Downdating lexicon and language model for automatic transcription of Czech historical spoken documents

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F46747885%3A24220%2F13%3A%230002791" target="_blank" >RIV/46747885:24220/13:#0002791 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1007/978-3-642-40585-3_26" target="_blank" >http://dx.doi.org/10.1007/978-3-642-40585-3_26</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-642-40585-3_26" target="_blank" >10.1007/978-3-642-40585-3_26</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Downdating lexicon and language model for automatic transcription of Czech historical spoken documents
Popis výsledku v původním jazyce
This paper deals with the task of adaptation of an existing Czech largevocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rud´e Pr´avo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we haveto apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ?downdating? (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings
Název v anglickém jazyce
Downdating lexicon and language model for automatic transcription of Czech historical spoken documents
Popis výsledku anglicky
This paper deals with the task of adaptation of an existing Czech largevocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rud´e Pr´avo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we haveto apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ?downdating? (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
JC - Počítačový hardware a software
OECD FORD obor
—

Návaznosti výsledku

Projekt
<a href="/cs/project/DF11P01OVV013" target="_blank" >DF11P01OVV013: Zpřístupnění archivu Českého rozhlasu pro sofistikované vyhledávání</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2013
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
16th International Conference, TSD 2013
ISBN
9783642405846
ISSN
0302-9743
e-ISSN
—
Počet stran výsledku
8
Strana od-do
201-208
Název nakladatele
Springer-Verlag Berlin Heidelber
Místo vydání
Germany, Berlin
Místo konání akce
Czech Republic, Pilsen
Datum konání akce
1. 9. 2013
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives Parts of speech as markers of the convergence of written and spoken Czech (based on the material of journalistic texts, 1990-2019)Parts of speech as markers of the convergence of written and spoken Czech

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Downdating lexicon and language model for automatic transcription of Czech historical spoken documents

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)