
OpenAI and Google Shocked by the First EVER Open Source AI Agent
December 10, 2025
Google Titan AI: What Makes Google’s Model a Game-Changer in 2025
December 10, 2025![Why Vision Language Models Ignore What They See [Munawar Hayat] - 758](https://i1.ytimg.com/vi/8gm9pXhlzEc/hqdefault.jpg)
In this episode, we’re joined by Munawar Hayat, researcher at Qualcomm AI Research, to discuss a series of papers presented at NeurIPS 2025 focusing on multimodal and generative AI. We dive into the persistent challenge of object hallucination in Vision-Language Models (VLMs), why models often discard visual information in favor of pre-trained language priors, and how his team used attention-guided alignment to enforce better visual grounding. We also explore a novel approach to generalized contrastive learning designed to solve complex, composed retrieval tasks—such as searching via combined text and image queries—without increasing inference costs. Finally, we cover the difficulties generative models face when rendering multiple human subjects, and the new "MultiHuman Testbench" his team created to measure and mitigate issues like identity leakage and attribute blending. Throughout the discussion, we examine how these innovations align with the need for efficient, on-device AI deployment.
🗒️ For the full list of resources for this episode, visit the show notes page: https://twimlai.com/go/758.
🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1
🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/
Follow us on Twitter: https://twitter.com/twimlai
Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/
Join our Slack Community: https://twimlai.com/community/
Subscribe to our newsletter: https://twimlai.com/newsletter/
Want to get in touch? Send us a message: https://twimlai.com/contact/
📖 CHAPTERS
===============================
00:00 – Introduction
04:35 – Physics-aware generation
06:56 – Challenges in physics-based visual generation
10:26 – Attention Guided Alignment in Vision Language Models paper
15:30 – Injecting visual tokens with cross-attention
18:45 – Computational performance during training
20:06 – Cross-attention for reducing squared complexity for longer context
21:05 – Benchmarks
23:16 – Hallucination and failure modes in VLMs
29:45 – Generalized Contrastive Learning (GCL) paper
38:01 – Retrieval on mobile devices
39:56 – ComGeneralized contrastive learning
40:46 – Benchmarks
41:54 – MultiHuman Testbench pape
49:33 – Efficiency against MultiHuman Testbench
51:33 – Qualcomm NeurIPS papers and demos
🔗 LINKS & RESOURCES
===============================
Attention Guided Alignment in Vision Language Models – https://openreview.net/forum?id=m5XcKijrTq
Generalized Contrastive Learning (GCL): Better Search Across Text and Images – https://arxiv.org/abs/2509.25638
MultiHuman Testbench: Raising the Bar for Multi Person Image Generation – https://arxiv.org/abs/2506.20879
Qualcomm at NeurIPS 2025: Pushing the boundaries of AI research – https://www.qualcomm.com/news/onq/2025/12/qualcomm-ai-research-neurips-2025
High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui – 753 – https://twimlai.com/podcast/twimlai/high-efficiency-diffusion-models-for-on-device-image-generation-and-editing/
📸 Camera: https://amzn.to/3TQ3zsg
🎙️Microphone: https://amzn.to/3t5zXeV
🚦Lights: https://amzn.to/3TQlX49
🎛️ Audio Interface: https://amzn.to/3TVFAIq
🎚️ Stream Deck: https://amzn.to/3zzm7F5


