About Vision and Language models: What grounded linguistic phenomena do they understand? How much do they use the image and text modality?

Letitia Parcalabescu, Department of Computational Linguistics, Heidelberg University

November 28th 2023 @ 4pm
Multimodal models are making headlines, with models like ChatGPT now being able to interpret images.

We are excited to have Letitia Parcalabescu, a PhD student at Heidelberg University who has already worked on projects with Aleph Alpha and is also a machine learning Youtuber, speaking at the DKFZ. During her talk, she will illuminate the methodologies to evaluate language vision models for fine-grained linguistic tasks and also how to explain their outputs to make them safe for human interaction.

We hope to see you there, learning with us about future multimodal models.

Abstract

In this talk, we will introduce Vision and Language (VL) models which can very well say if an image and text are related and answer questions about images. While performance on these tasks is important, task-centered evaluation does not tell us why they are so good at these tasks, such as what are the fine-grained linguistic capabilities of VL models use when solving them. Therefore, we present our work on the VALSE💃 benchmark to test six specific linguistic phenomena grounded in images. Our zero-shot experiments with five widely-used pretrained VL models suggest that current VL models have considerable difficulty addressing most phenomena. In the second part, we ask how much a VL model uses the image and text modality in each sample or dataset. To measure the contribution of each modality in a VL model, we developed MM-SHAP which we applied in two ways: (1) to compare VL models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided.

Biography

After having studied Physics and Computer Science, Letitia is a PhD candidate at Heidelberg University in the Heidelberg Natural Language Processing Group. Her research focuses on vision and language integration in multimodal deep learning. Her side-project revolves around the "AI Coffee Break with Letitia" YouTube channel, where the animated Ms. Coffee Bean explains and visualizes concepts from the latest research in Artificial Intelligence.




Letitia Parcalabescu

Event Info

Please help us plan ahead by registrating for the event at our meetup event-site .
After the event, there will be a social get-together with food and drinks courtesy of the Division of Medical Image Computing and Interactive Machine Learning Group at the DKFZ.

What? About Vision and Language models: What grounded linguistic phenomena do they understand? How much do they use the image and text modality?
Who? Letitia Parcalabescu, Department of Computational Linguistics, Heidelberg University
When? November 28th 2023 @ 4pm
Where? DKFZ Communication Center (seminar rooms K1+K2), Im Neuenheimer Feld 280
Registration meetup event-site