Post-Pretraining in Vision, and Language Foundation Models

Yuki M. Asano, University of Technology Nuremberg

upcoming: May 13th 2025 @ 5pm
Foundation Models are reshaping the landscape of artificial intelligence, offering unprecedented capabilities across vision, language, and multi-modal tasks. From understanding spatial structure in images to aligning language models with the visual world, these innovations are opening up new frontiers in AI research.

We are excited to welcome Yuki Asano, full Professor at the University of Technology Nuremberg and head of the Fundamental AI (FunAI) Lab, to our joint heidelberg.ai / NCT Data Science Seminar series. In this in-person event, Yuki Asano will present recent advances in building on top of pretrained Foundation Models to boost performance in dense prediction, efficient fine-tuning, and cross-modal understanding. The talk will cover novel methods such as NeCo for improving spatial perception in vision models, self-supervised time-tuning for video, and ultra-lightweight CLIP training. He will also explore surprising links between large language models and their visual grounding.

We look forward to your participation in this exciting seminar, where you will valuable insights into how cutting-edge methods are pushing the boundaries of what Foundation Models can achieve.

Abstract

This talk explores how pretrained Foundation Models can be further enhanced for vision, language, and multi-modal tasks. It begins by addressing a key limitation in models like DINOv2: their lack of spatial understanding in images. To address this, the NeCo method [1] is introduced—a lightweight post-pretraining technique based on patch-nearest neighbors that significantly improves dense prediction performance while requiring only 16 GPU hours. The talk then highlights how video data can be utilized to further enhance dense understanding in pretrained image models such as DINO [2]. In the domain of language models, recent work on parameter-efficient finetuning (PEFT) [3] and instruction tuning [4] is presented, offering practical strategies for adapting large models effectively. The final part of the talk introduces a novel approach to training CLIP models using only 10 GPU hours by leveraging pretrained unimodal encoders. Intriguingly, the results reveal a strong correlation between the performance of language models and their visual alignment capabilities [5].

Biography

Yuki Asano is the head of the Fundamental AI (FunAI) Lab and a full Professor at the University of Technology Nuremberg. Prior to this, he led the QUVA Lab at the University of Amsterdam, where he worked in close collaboration with Qualcomm AI Research. He completed his PhD at the Visual Geometry Group (VGG) at the University of Oxford, under the supervision of Andrea Vedaldi and Christian Rupprecht.




Yuki M. Asano

Event Info

Please help us plan ahead by registrating for the event at our meetup event-site .

What? Post-Pretraining in Vision, and Language Foundation Models
Who? Yuki M. Asano, University of Technology Nuremberg
When? May 13th 2025 @ 5pm
Where? DKFZ Communication Center (seminar rooms K1+K2), Im Neuenheimer Feld 280
Registration meetup event-site

References

[1] Pariza, V., Salehi, M., Burghouts, G., Locatello, F., & Asano, Y. M. (2024). Near, far: Patch-ordering enhances vision foundation models’ scene understanding. In ICLR 2024.

[2] Salehi, M., Gavves, E., Snoek, C. G. M., & Asano, Y. M. (2023). Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations. In ICCV 2023. (+Ongoing work)

[3] Kopiczko, D., Blankevoort, T., & Asano, Y. M. (2024). VeRA: Vector-based Random Matrix Adaptation. In ICLR 2024.

[4] Kopiczko, D. J., Blankevoort, T., & Asano, Y. M. (2024). Bitune: Leveraging Bidirectional Attention to Improve Decoder-Only LLMs. arXiv preprint.

[5] Ruthardt, J., Burghouts, G. J., Belongie, S., & Asano, Y. M. (2024). Better Language Models Exhibit Higher Visual Alignment. arXiv preprint.