Aflevering 119: Your Web App Scaling Tricks Don’t Work for LLMs

Jan Stomphorst

Luister naar deze aflevering op jouw favoriete platform!

Aflevering 119: Your Web App Scaling Tricks Don’t Work for LLMs

November 18, 2025

35

MIN

Aflevering 119: Your Web App Scaling Tricks Don’t Work for LLMs

In this episode, we talk with Abdel Sghiouar and Mofi Rahman, Developer Advocates at Google and (guest) hosts of the Kubernetes Podcast from Google. Together, we dive into one central question: can you truly run LLMs reliably and at scale on Kubernetes?

Samenvatting

In this episode, we talk with Abdel Sghiouar and Mofi Rahman, Developer Advocates at Google and (guest) hosts of the Kubernetes Podcast from Google.
Together, we dive into one central question: can you truly run LLMs reliably and at scale on Kubernetes?

It quickly becomes clear that LLM workloads behave nothing like traditional web applications:

GPUs are scarce, expensive, and difficult to schedule.
Models are massive — some reaching 700GB — making load times, storage throughput, and caching critical.
Containers become huge, making “build small containers” nearly impossible.
Autoscaling on CPU or RAM doesn’t work; new signals like GPU cache pressure, queue depth, and model latency take over.
LLMs don’t run in parallel, so batching and routing through the Inference Gateway API become essential.
Device Management and Dynamic Resource Allocation (DRA) are forming the new foundation for GPU/TPU orchestration.
Security shifts as rootless containers often no longer work with hardware accelerators.
Guardrails (input/output filtering) become a built-in part of the inference path.

And then there’s the occasional request from customers who want deterministic LLM output —
to which Mofi dryly responds:
“You don’t need a model — you need a database.”

‍

Laatste afleveringen

Aflevering 118: Why Ceph Still Rules Cloud-Native Storage

November 11, 2025

40

MIN

Aflevering 119: Your Web App Scaling Tricks Don’t Work for LLMs

Aflevering 119: Your Web App Scaling Tricks Don’t Work for LLMs

Samenvatting

Laatste afleveringen

Aflevering 118: Why Ceph Still Rules Cloud-Native Storage

Aflevering 117: How Policy as Code Is Changing Kubernetes Forever

Aflevering 116: Running AI on Kubernetes: From GPUs to CRO