In this talk, we will introduce Vision and Language (VL) models which can very well say if an image and text are related and answer questions about images. While performance on these tasks is important, task-centered evaluation does not tell us why they are so good at these tasks, such as what are the fine-grained linguistic capabilities of VL models use when solving them. Therefore, we present our work on the VALSE💃 benchmark to test six specific linguistic phenomena grounded in images. Our zero-shot experiments with five widely-used pretrained VL models suggest that current VL models have considerable difficulty addressing most phenomena. In the second part, we ask how much a VL model uses the image and text modality in each sample or dataset. To measure the contribution of each modality in a VL model, we developed MM-SHAP which we applied in two ways: (1) to compare VL models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided.
After having studied Physics and Computer Science, Letitia is a PhD candidate at Heidelberg University in the Heidelberg Natural Language Processing Group. Her research focuses on vision and language integration in multimodal deep learning. Her side-project revolves around the "AI Coffee Break with Letitia" YouTube channel, where the animated Ms. Coffee Bean explains and visualizes concepts from the latest research in Artificial Intelligence.
Please help us plan ahead by registrating for the event at our
After the event, there will be a social get-together with food and drinks courtesy of the Division of Medical Image Computing and Interactive Machine Learning Group at the DKFZ.
|What?||About Vision and Language models: What grounded linguistic phenomena do they understand? How much do they use the image and text modality?|
|Who?||Letitia Parcalabescu, Department of Computational Linguistics, Heidelberg University|
|When?||November 28th 2023 @ 4pm|
|Where?||DKFZ Communication Center (seminar rooms K1+K2), Im Neuenheimer Feld 280|