Structural metadata annotation of speech corpora: Comparing broadcast news and broadcast conversations

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F08%3A00500384" target="_blank" >RIV/49777513:23520/08:00500384 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—

Alternative languages

Result language
angličtina
Original language name
Structural metadata annotation of speech corpora: Comparing broadcast news and broadcast conversations
Original language description
Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper compares two Czech MDE speech corpora, one in the domainof broadcast news and the other in the domain of broadcast conversations. A variety of statistics about fillers, edit disfluencies, and syntactic/semantic units are presented. In addition, it is reported that disfluent portions of speech show differencesin the distribution of parts of speech (POS) of their content in comparison with the general POS distribution.
Czech name
Anotace strukturálních metadat v řečových korpusech: Srovnání rozhlasových zpráv a rozhlasových diskuzí
Czech description
V úlohách extrakce strukturálních metadat (MDE) je cílem vyvinout techniky pro automatickou konverzi nestrukturovaného výstupu z automatického rozpoznávače řeči do formy více čitelné a vhodnější pro následné zpracování. Toho může být dosaženo vložením hranic syntaktických celků a označením výplňkových a opravených slov pro jejich případné vymazání. Tento článek srovnává dva české řečové MDE korpusy, jeden v doméně zpráv a druhý v doméně živě přenášených diskuzí. Je zde prezentováno množství statistik ovýplňových slovech a frázích, editačních neplynulostech a syntakticko-sémantických jednotkách. Mimo jiné uvádíme statistiky ukazující, že neplynulé části řeči mají významně jiné rozdělení slovních druhů než celý korpus. Dva popisované české korpusy nejsou pouze srovnány mezi sebou, ale také s dostupnými anglickými korpusy.

Classification

Type
D - Article in proceedings
CEP classification
JD - Use of computers, robotics and its application
OECD FORD branch
—

Result continuities

Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

Publication year
2008
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Article name in the collection
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
ISBN
2-9517408-4-0
ISSN
—
e-ISSN
—
Number of pages
6
Pages from-to
—
Publisher name
ELRA
Place of publication
Paris
Event location
Marrakech
Event date
Jun 1, 2008
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—

Similar results(10)

Czech Broadcast Conversation MDE Transcripts Czech Broadcast News MDE Transcripts The Czech Broadcast Conversation Corpus

What are you looking for?

Quick search

Smart search

Structural metadata annotation of speech corpora: Comparing broadcast news and broadcast conversations

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)