Aflevering 136: vLLM, LMD, and the Quest to Build the Linux of AI Inference

Jan Stomphorst
Ronald Kers
Luister naar deze aflevering op jouw favoriete platform!
Apple Podcast Icon - Radio Webflow TemplateSpotify Icon- Radio Webflow TemplateGoogle Podcast Icon - Radio Webflow TemplateAnchor Icon - Radio Webflow TemplateSoundCloud Icon - Radio Webflow Template
Aflevering 136: vLLM, LMD, and the Quest to Build the Linux of AI Inference
June 9, 2026
32
 MIN

Aflevering 136: vLLM, LMD, and the Quest to Build the Linux of AI Inference

vLLM serves as a "Rosetta Stone" between the ever-growing zoo of models (Llama, DeepSeek, Mistral, Qwen, Nvidia Nemotron) and accelerators (Nvidia, AMD, Intel, Google TPUs).

Samenvatting

In this episode, hosts Ronald and Jan are joined at KubeCon by two guests from Red Hat: Brian Stevens, AI CTO and one of the original architects behind the creation of Kubernetes and the CNCF, and Rob Shaw, co-lead of the vLLM project and maintainer of LMD.

Brian shares the remarkable backstory of how Kubernetes came to be open source, including how Red Hat negotiated a single committer seat before agreeing to be a launch partner, and how he later pushed Google to contribute Kubernetes to the newly formed CNCF rather than keeping it proprietary like TensorFlow.

Rob explains what an inference runtime actually is: the critical piece of software that takes an abstract AI model and runs it as efficiently as possible on a GPU or other accelerator — handling everything from CUDA-level kernel optimization to memory management and concurrent request scheduling. vLLM serves as a "Rosetta Stone" between the ever-growing zoo of models (Llama, DeepSeek, Mistral, Qwen, Nvidia Nemotron) and accelerators (Nvidia, AMD, Intel, Google TPUs).

The conversation covers model compression and quantization how techniques like 4-bit precision can deliver 2x hardware efficiency gains while preserving 99%+ model accuracy. Brian and Rob also address the "big model vs. many small models" debate, recommending to always start with the largest capable model to validate a use case before optimizing down.

Looking ahead, both guests see inference as potentially the single largest workload ever run on Kubernetes, and position LMD (now contributed to the CNCF) as the distributed inference layer that will make this possible across heterogeneous accelerator environments  preventing enterprises from ending up with 42 incompatible AI stacks.
The episode closes with a discussion on AI slop, human-in-the-loop thinking, and the future of Kubernetes as the universal platform for running AI agents at scale.

Powered by @acc-ict

Stuur ons een bericht.

ACC ICT Specialist in IT-CONTINUÏTEIT
Bedrijfskritische applicaties én data veilig beschikbaar, onafhankelijk van derden, altijd en overal

Support the show

Like and subscribe! It helps out a lot.

You can also find us on:
De Nederlandse Kubernetes Podcast - YouTube
Nederlandse Kubernetes Podcast (@k8spodcast.nl) | TikTok
De Nederlandse Kubernetes Podcast

Where can you meet us:
Events

This Podcast is powered by:
ACC ICT - IT-Continuïteit voor Bedrijfskritische Applicaties | ACC ICT