EP03·A User-Centric Perspective on LLM Inference·

Woosuk Kwon is the co-founder and CTO of Inferact and a PhD graduate of UC Berkeley whose work sits at the intersection of systems design and large language models. As co-creator of vLLM — the widely adopted open-source inference engine — Woosuk has shaped how organizations serve LLMs at scale.

In this episode, we explore the “user-centric” design choices that led to vLLM’s success, the technical breakthroughs of PagedAttention, and the future of streaming inference in human-AI interaction.

Key Takeaways

  • User-Centric Systems: Why developer experience (DX) was the secret weapon for vLLM’s rapid adoption.
  • PagedAttention: How borrowing principles from operating systems solved the KV cache memory bottleneck.
  • Streaming Inference: Moving beyond “turn-taking” to real-time, interruptible AI conversations.
  • White-Box Inference: The case for open-source infrastructure in a world of black-box APIs.

Timestamps

  • 0:00 — Intro & guest welcome
  • 3:00 — From SkyPilot to vLLM: Woosuk’s PhD journey at Berkeley
  • 9:18 — The design philosophy of vLLM: User-centricity in systems
  • 18:18Streaming Inference & breaking fundamental assumptions
  • 27:03 — Training-inference co-design: The next frontier
  • 37:28 — Open-source sustainability & the founding of Inferact
  • 40:46White-Box vs. Black-Box: Why inference transparency matters
  • 43:46 — The future: Local models, agentic workloads, and model interruption

Subscribe and follow us for more episodes!