WISMIR3 A Multi-Modal Dataset to Challenge Text-Image Retrieval Approaches
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AFJPJJFZM" target="_blank" >RIV/00216208:11320/25:FJPJJFZM - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85204447076&partnerID=40&md5=927e9eff0bfff03223d4074fcf995dcd" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85204447076&partnerID=40&md5=927e9eff0bfff03223d4074fcf995dcd</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
WISMIR3 A Multi-Modal Dataset to Challenge Text-Image Retrieval Approaches
Popis výsledku v původním jazyce
This paper presents WISMIR3, a multi-modal dataset comprising roughly 300K text-image pairs from Wikipedia. With a sophisticated automatic ETL pipeline, we scraped, filtered, and transformed the data so that WISMIR3 intrinsically differs from other popular text-image datasets like COCO and Flickr30k. We prove this difference by comparing various linguistic statistics between the three datasets computed using the pipeline. The primary purpose of WISMIR3 is to use it as a benchmark to challenge state-of-the-art text-image retrieval approaches, which already reach around 90% Recall@5 scores on the mentioned popular datasets. Therefore, we ran several text-image retrieval experiments on our dataset using current models, which show that the models, in fact, perform significantly worse compared to evaluation results on COCO and Flickr30k. In addition, for each text-image pair, we release features computed by Faster-R-CNN and CLIP models. With this, we want to ease and motivate the use of the dataset for other researchers. © 2024 Association for Computational Linguistics.
Název v anglickém jazyce
WISMIR3 A Multi-Modal Dataset to Challenge Text-Image Retrieval Approaches
Popis výsledku anglicky
This paper presents WISMIR3, a multi-modal dataset comprising roughly 300K text-image pairs from Wikipedia. With a sophisticated automatic ETL pipeline, we scraped, filtered, and transformed the data so that WISMIR3 intrinsically differs from other popular text-image datasets like COCO and Flickr30k. We prove this difference by comparing various linguistic statistics between the three datasets computed using the pipeline. The primary purpose of WISMIR3 is to use it as a benchmark to challenge state-of-the-art text-image retrieval approaches, which already reach around 90% Recall@5 scores on the mentioned popular datasets. Therefore, we ran several text-image retrieval experiments on our dataset using current models, which show that the models, in fact, perform significantly worse compared to evaluation results on COCO and Flickr30k. In addition, for each text-image pair, we release features computed by Faster-R-CNN and CLIP models. With this, we want to ease and motivate the use of the dataset for other researchers. © 2024 Association for Computational Linguistics.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
ALVR - Workshop Adv. Lang. Vis. Res., Proc. Workshop
ISBN
979-889176153-7
ISSN
—
e-ISSN
—
Počet stran výsledku
6
Strana od-do
1-6
Název nakladatele
Association for Computational Linguistics (ACL)
Místo vydání
—
Místo konání akce
Bangkok
Datum konání akce
1. 1. 2025
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—