Inference scaling · Case Study 03

GPU Platform Modernization

The design separates ingress, scheduling, GPU slices, serving, and observability. Run:AI owns allocation policy. vLLM serves traffic. OTel shows saturation.

ACTIVEBack to case studies

status

ACTIVE

environment

OpenShift / Run:AI

ingress

Kube Ingress Controller

runtime graph

5 nodes / 5 edges

View system map

System map

GPU Platform Modernization

Env: OpenShift / Run:AIIngress: Kube Ingress Controller

Problem: I worked on inference platform patterns where static GPU allocation slowed teams down. Production serving needed quota, priority, and predictable capacity.

My engineering note

The design separates ingress, scheduling, GPU slices, serving, and observability. Run:AI owns allocation policy. vLLM serves traffic. OTel shows saturation.

Live path

workload class: Kube ingress -> Run:AI policy

Path running

Architecture Decision

Why I chose this design.

Short decision notes tied to the code or config that mattered.

Decision

gpu-allocation.yaml

I kept GPU allocation policy separate from model serving. Platform teams can change quota and priority without rewriting the runtime.

Code noteSee engineering notes

gpu-allocation.yamlyaml

yaml

apiVersion: scheduling.run.ai/v1kind: PodGroupmetadata:  name: vllm-servingspec:  gpuAllocation: 0.5  gpuMemory: 12Gi  priority: HighPriority  schedulingStrategy: BinPacking  affinity:    nodeAffinity:      requiredDuringSchedulingIgnoredDuringExecution:        nodeSelectorTerms:          - matchExpressions:              - key: nvidia.com/gpu.family                operator: In                values: ["H100"]

apiVersion: scheduling.run.ai/v1kind: PodGroupmetadata:  name: vllm-servingspec:  gpuAllocation: 0.5  gpuMemory: 12Gi  priority: HighPriority  schedulingStrategy: BinPacking  affinity:    nodeAffinity:      requiredDuringSchedulingIgnoredDuringExecution:        nodeSelectorTerms:          - matchExpressions:              - key: nvidia.com/gpu.family                operator: In                values: ["H100"]

Next case study

AI Architecture Enablement

Upskilling teams - I designed patterns for teams adopting MCP-style tools. The goal was to let agents call databases and APIs without expos...

Need this level of architecture review?

Bring the hard system constraint: retrieval quality, agent failure modes, latency, evaluation, deployment topology, or technical market education.

Start Advisory Intake View GitHub