Extracting Visually Presented Element Relationships from Web Documents
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216305%3A26230%2F13%3APU108091" target="_blank" >RIV/00216305:26230/13:PU108091 - isvavai.cz</a>
Výsledek na webu
<a href="http://www.fit.vutbr.cz/research/pubs/all.php?id=10468" target="_blank" >http://www.fit.vutbr.cz/research/pubs/all.php?id=10468</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Extracting Visually Presented Element Relationships from Web Documents
Popis výsledku v původním jazyce
Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, we propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applica- tions. We formally define the model, we introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and we discuss the expected applications. We also present a new dataset consisting of programmes of conferences and other scientific events and we discuss its suitability for the task in hand. Finally, we use the dataset to evaluate results of the implemented system.
Název v anglickém jazyce
Extracting Visually Presented Element Relationships from Web Documents
Popis výsledku anglicky
Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, we propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applica- tions. We formally define the model, we introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and we discuss the expected applications. We also present a new dataset consisting of programmes of conferences and other scientific events and we discuss its suitability for the task in hand. Finally, we use the dataset to evaluate results of the implemented system.
Klasifikace
Druh
J<sub>ost</sub> - Ostatní články v recenzovaných periodicích
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2013
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
International Journal of Cognitive Informatics and Natural Intelligence
ISSN
1557-3958
e-ISSN
1557-3966
Svazek periodika
2013
Číslo periodika v rámci svazku
2
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
17
Strana od-do
13-29
Kód UT WoS článku
—
EID výsledku v databázi Scopus
—