Victor: the Web-Page Cleaning Tool
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F08%3A10077973" target="_blank" >RIV/00216208:11320/08:10077973 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Victor: the Web-Page Cleaning Tool
Original language description
In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing and computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content and HTML structure of the page. The blocks are automatically labeled either as content segments containing main web page content, which should be preserved, or as noisy segments not suitable for further linguistic processing, which should be eliminated. Our solution is based on the tool introduced at the CLEANEVAL 2007 shared task workshop. In this paper, we present new CRF features, a handy annotation tool, and new evaluation metrics. Evaluation itself is performed on a random sample of web pages automatically downloaded from the Czech web domain.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
AI - Linguistics
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/GD201%2F05%2FH014" target="_blank" >GD201/05/H014: Collegium Informaticum</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>Z - Vyzkumny zamer (s odkazem do CEZ)
Others
Publication year
2008
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the 4th Web as Corpus Workshop
ISBN
2-9517408-4-0
ISSN
—
e-ISSN
—
Number of pages
6
Pages from-to
—
Publisher name
ACL SIGWAC
Place of publication
Marrakech, Morocco
Event location
Marrakech, Morocco
Event date
Jun 1, 2008
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—