Hello World

Enve

About Login

Edit

Enve Docs

Public

agentic ai

How to deploy and serve AI models?

Serving LLM models at scale with Kubernetes, vLLM and llm-d

啥的啥

Tuesday, November 11, 2025 at 09:59 AM

In RedHat Summit Singapore 2025, RedHat demonstrated deploying Huggingface AI on OpenShift and expose as an API endpoint for applications to call in an Incident Ops ChatBot with Ansible Automation for fault remediation. This could be the most convincing LLM serving solution on the market now. The path is clear and the Kubernetes deployment platform is mature. Two key components enabled this - vllm and llm-d.

vllm enables distributed usage of accelerator hardware

vLLM builds on PagedAttention which aims to address these challenges in making more efficient use of the hardware needed to support AI workloads. PageAttention, a system inspired by virtual memory paging, manages the memory required for the KV cache, which stores intermediate states during inference.

PagedAttention is the primary algorithm that came out of vLLM project. However, PagedAttention is not the only capability that vLLM provides. vLLM is a framework that could be extended to offer additional performance optimisation such as continuous batching.

llm-d distributes inference workload across compute nodes

llm-d is an open source framework that integrates and builds on the power of vLLM. It's a recipe for performing distributed inference and was built to support increasing resource demands of LLMs.

Think of it this way, if vLLM helps with speed, llm-d helps with coordination. vLLM and llm-d work together to intelligently route traffic through the model and make processing happen as quickly and efficiently as possible.

llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale.

llm-d (llm-d.ai) accelerates distributed inference by integrating industry-standard open technologies: vLLM as default model server and engine, Inference Gateway as request scheduler and balancer, and Kubernetes as infrastructure orchestrator and workload control plane.

llm-d directly tests and validates multiple accelerator types including NVIDIA GPUs, AMD GPUs, Google TPUs, and Intel XPUs and provides common operational patterns to improve production reliability.