Recurrence and Attention for Long-Context Transformers [Jacob Buckman] – 750

Recurrence and Attention for Long-Context Transformers [Jacob Buckman] - 750

Today, we’re joined by Jacob Buckman, co-founder and CEO of Manifest AI to discuss achieving long context in transformers. We discuss the bottlenecks of scaling context length and recent techniques to overcome them, including windowed attention, grouped query attention, and latent space attention. We explore the idea of weight-state balance and the weight-state FLOP ratio as a way of reasoning about the optimality of compute architectures, and we dig into the Power Retention architecture, which blends the parallelization of attention with the linear scaling of recurrence and promises speedups of over 10x during training and over 100x during inference. We review Manifest AI’s recent open source projects as well: Vidrial—a custom CUDA framework for building highly optimized GPU kernels in Python, and PowerCoder—a 3B-parameter coding model fine-tuned from StarCoder to use power retention. Our chat also covers the use of metrics like in-context learning curves and negative log likelihood to measure context utility, the implications of scaling laws, and the future of long context lengths in AI applications.

🗒️ For the full list of resources for this episode, visit the show notes page: https://twimlai.com/go/750.

🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1

🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/
Follow us on Twitter: https://twitter.com/twimlai
Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/
Join our Slack Community: https://twimlai.com/community/
Subscribe to our newsletter: https://twimlai.com/newsletter/
Want to get in touch? Send us a message: https://twimlai.com/contact/

📖 CHAPTERS
===============================
00:00 – Introduction
3:43 – Why is context important?
5:28 – Metrics to define context utility
7:14 – In-context learning curves
10:23 – Long-context with Transformers
11:40 – Retention
14:25 – Chunking algorithm
17:43 – Techniques for overcoming quadratic scaling issues
22:58 – Balancing design levers
24:39 – Transformers versus state space models
27:05 – Importance of balancing state size and number of parameters
28:03 – Weight-flops and state-flops
33:33 – Power Retention paper
36:44 – Vidrial framework
41:14 – Core principles
44:50 – PowerCoder model
48:45 – Long context benchmark performance
50:44 – Model possibilities and limitations
55:17 – Trying out the model

🔗 LINKS & RESOURCES
===============================
Power Retention – Manifest AI – https://manifestai.com/articles/release-power-retention/
Scaling Context Requires Rethinking Attention – https://arxiv.org/abs/2507.04239
Vidrial (GitHub) – https://github.com/m-a-n-i-f-e-s-t/vidrial
PowerCoder-3B (Huggingface) – https://huggingface.co/manifestai/powercoder-3b
Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu – 693 – https://twimlai.com/podcast/twimlai/mamba-mamba-2-and-post-transformer-architectures-for-generative-ai/

📸 Camera: https://amzn.to/3TQ3zsg
🎙️Microphone: https://amzn.to/3t5zXeV
🚦Lights: https://amzn.to/3TQlX49
🎛️ Audio Interface: https://amzn.to/3TVFAIq
🎚️ Stream Deck: https://amzn.to/3zzm7F5

OpenAI Just Dropped ChatGPT Apps SDK: Massive Upgrade!

OpenAI Just Dropped ChatGPT Apps SDK: Massive Upgrade!

Related posts

Why Do We Get Sick? | How the Human Body Breaks Down | Explained Simply

OpenAI Just Dropped ChatGPT Apps SDK: Massive Upgrade!

​Sora 2 + GPT-6: OpenAI Just Shocked Everyone

Sora 2 + GPT-6: OpenAI Just Shocked Everyone