Meta-Personalizing Vision-Language Models to Find Named Instances in Video

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F23%3A00372042" target="_blank" >RIV/68407700:21730/23:00372042 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1109/CVPR52729.2023.01833" target="_blank" >https://doi.org/10.1109/CVPR52729.2023.01833</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/CVPR52729.2023.01833" target="_blank" >10.1109/CVPR52729.2023.01833</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Popis výsledku v původním jazyce
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as “My dog Biscuit” appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and Deep-Fashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.
Název v anglickém jazyce
Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Popis výsledku anglicky
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as “My dog Biscuit” appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and Deep-Fashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
ISBN
979-8-3503-0130-4
ISSN
1063-6919
e-ISSN
2575-7075
Počet stran výsledku
10
Strana od-do
19123-19132
Název nakladatele
IEEE Computer Society
Místo vydání
USA
Místo konání akce
Vancouver
Datum konání akce
18. 6. 2023
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
001062531303042

Podobné výsledky(10)

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes Video Search with Context-Aware Ranker and Relevance Feedback

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Meta-Personalizing Vision-Language Models to Find Named Instances in Video

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)