Aflevering 136: vLLM, LMD, and the Quest to Build the Linux of AI Inference

Jan Stomphorst

Ronald Kers

Luister naar deze aflevering op jouw favoriete platform!

Aflevering 136: vLLM, LMD, and the Quest to Build the Linux of AI Inference

June 9, 2026

32

MIN

Aflevering 136: vLLM, LMD, and the Quest to Build the Linux of AI Inference

vLLM serves as a "Rosetta Stone" between the ever-growing zoo of models (Llama, DeepSeek, Mistral, Qwen, Nvidia Nemotron) and accelerators (Nvidia, AMD, Intel, Google TPUs).

Samenvatting

In this episode, hosts Ronald and Jan are joined at KubeCon by two guests from Red Hat: Brian Stevens, AI CTO and one of the original architects behind the creation of Kubernetes and the CNCF, and Rob Shaw, co-lead of the vLLM project and maintainer of LMD.

Brian shares the remarkable backstory of how Kubernetes came to be open source, including how Red Hat negotiated a single committer seat before agreeing to be a launch partner, and how he later pushed Google to contribute Kubernetes to the newly formed CNCF rather than keeping it proprietary like TensorFlow.

Rob explains what an inference runtime actually is: the critical piece of software that takes an abstract AI model and runs it as efficiently as possible on a GPU or other accelerator — handling everything from CUDA-level kernel optimization to memory management and concurrent request scheduling. vLLM serves as a "Rosetta Stone" between the ever-growing zoo of models (Llama, DeepSeek, Mistral, Qwen, Nvidia Nemotron) and accelerators (Nvidia, AMD, Intel, Google TPUs).

The conversation covers model compression and quantization how techniques like 4-bit precision can deliver 2x hardware efficiency gains while preserving 99%+ model accuracy. Brian and Rob also address the "big model vs. many small models" debate, recommending to always start with the largest capable model to validate a use case before optimizing down.

Looking ahead, both guests see inference as potentially the single largest workload ever run on Kubernetes, and position LMD (now contributed to the CNCF) as the distributed inference layer that will make this possible across heterogeneous accelerator environments preventing enterprises from ending up with 42 incompatible AI stacks.
The episode closes with a discussion on AI slop, human-in-the-loop thinking, and the future of Kubernetes as the universal platform for running AI agents at scale.

Powered by @acc-ict