Information Extraction from the Web by Matching Visual Presentation Patterns

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216305%3A26230%2F17%3APU126373" target="_blank" >RIV/00216305:26230/17:PU126373 - isvavai.cz</a>
Výsledek na webu
<a href="https://link.springer.com/chapter/10.1007/978-3-319-68723-0_2" target="_blank" >https://link.springer.com/chapter/10.1007/978-3-319-68723-0_2</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-68723-0_2" target="_blank" >10.1007/978-3-319-68723-0_2</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Information Extraction from the Web by Matching Visual Presentation Patterns
Popis výsledku v původním jazyce
The documents available in the World Wide Web contain large amounts of information presented in tables, lists or other visually regular structures. The published information is however usually not annotated explicitly or implicitly and its interpretation is left on a human reader. This makes the information extraction from web documents a challenging problem. Most existing approaches are based on a top-down approach that proceeds from the larger page regions to individual data records, which depends on different heuristics. We present an opposite bottom-up approach. We roughly identify the smallest data fields in the document and later, we refine this approximation by matching the discovered visual presentation patterns with the expected semantic structure of the extracted information. This approach allows to efficiently extract structured data from heterogeneous documents without any kind of additional annotations as we demonstrate experimentally on various application domains.
Název v anglickém jazyce
Information Extraction from the Web by Matching Visual Presentation Patterns
Popis výsledku anglicky
The documents available in the World Wide Web contain large amounts of information presented in tables, lists or other visually regular structures. The published information is however usually not annotated explicitly or implicitly and its interpretation is left on a human reader. This makes the information extraction from web documents a challenging problem. Most existing approaches are based on a top-down approach that proceeds from the larger page regions to individual data records, which depends on different heuristics. We present an opposite bottom-up approach. We roughly identify the smallest data fields in the document and later, we refine this approximation by matching the discovered visual presentation patterns with the expected semantic structure of the extracted information. This approach allows to efficiently extract structured data from heterogeneous documents without any kind of additional annotations as we demonstrate experimentally on various application domains.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
<a href="/cs/project/LQ1602" target="_blank" >LQ1602: IT4Innovations excellence in science</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2017
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Knowledge Graphs and Language Technology: ISWC 2016 International Workshops: KEKI and NLP&DBpedia
ISBN
978-3-319-68722-3
ISSN
—
e-ISSN
—
Počet stran výsledku
17
Strana od-do
10-26
Název nakladatele
Springer International Publishing
Místo vydání
Kobe
Místo konání akce
Kobe
Datum konání akce
17. 10. 2016
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000535971000002

Podobné výsledky(10)

Information Extraction from Web Sources based on Multi-aspect Content Analysis Deep Neural Networks for Web Page Information Extraction Extracting Visually Presented Element Relationships from Web Documents

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Information Extraction from the Web by Matching Visual Presentation Patterns

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)