A User-Centric Perspective on LLM Inference

Woosuk Kwon is the co-founder and CTO of Inferact and a PhD graduate of UC Berkeley whose work sits at the intersection of systems design and large language models. As co-creator of vLLM — the widely adopted open-source inference engine — Woosuk has shaped how organizations serve LLMs at scale.

In this episode, we explore the “user-centric” design choices that led to vLLM’s success, the technical breakthroughs of PagedAttention, and the future of streaming inference in human-AI interaction.

Key Takeaways

User-Centric Systems: Why developer experience (DX) was the secret weapon for vLLM’s rapid adoption.
PagedAttention: How borrowing principles from operating systems solved the KV cache memory bottleneck.
Streaming Inference: Moving beyond “turn-taking” to real-time, interruptible AI conversations.
White-Box Inference: The case for open-source infrastructure in a world of black-box APIs.

Timestamps

0:00 — Intro & guest welcome
3:00 — From SkyPilot to vLLM: Woosuk’s PhD journey at Berkeley
9:18 — The design philosophy of vLLM: User-centricity in systems
18:18 — Streaming Inference & breaking fundamental assumptions
27:03 — Training-inference co-design: The next frontier
37:28 — Open-source sustainability & the founding of Inferact
40:46 — White-Box vs. Black-Box: Why inference transparency matters
43:46 — The future: Local models, agentic workloads, and model interruption

Subscribe and follow us for more episodes!

The AM Podcast

Episodes