Genre Annotation of Web Corpora: Scheme and Issues
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F21%3A00118741" target="_blank" >RIV/00216224:14330/21:00118741 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/book/10.1007/978-3-030-63128-4" target="_blank" >https://link.springer.com/book/10.1007/978-3-030-63128-4</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-030-63128-4_55" target="_blank" >10.1007/978-3-030-63128-4_55</a>
Alternative languages
Result language
angličtina
Original language name
Genre Annotation of Web Corpora: Scheme and Issues
Original language description
Unlike traditional corpora made from printed media in the past decades, sources of web corpora are not categorised and described well, thus making it difficult to control the content of the corpus. This paper presents an attempt to classify genres in a large English web corpus through supervised learning. A set of genres suitable for web corpora users is defined based on a research of related work. A genre annotation scheme with active learning rounds is introduced. A collection of web pages representing various genres that was created for this task and a scheme of consequent human annotation of the data set is described. Measuring the inter-annotator agreement revealed that either the problem may not be well defined, or that our expectations concerning the precision and recall of the classifier cannot be met. Eventually, the project was postponed at that point. Possible solutions of the issue are discussed at the end of the paper.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
60203 - Linguistics
Result continuities
Project
<a href="/en/project/GA18-23891S" target="_blank" >GA18-23891S: Hyperintensional Reasoning over Natural Language Texts</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2021
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1
ISBN
9783030631277
ISSN
2194-5357
e-ISSN
2194-5365
Number of pages
17
Pages from-to
738-754
Publisher name
Springer Nature Switzerland AG
Place of publication
Vancouver, Canada
Event location
Vancouver, Canada
Event date
Nov 5, 2020
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—