Python FastAPI gateway · Case Study 01

AI Home Lab API Gateway

The pattern is simple: Cloudflare handles ingress, FastAPI owns auth and request shape, and Ollama stays local. I keep qwen3.5 warm, limit context, and return 429 when the Mac is full.

status

PRODUCTION

environment

macOS Apple Silicon + Podman Compose

ingress

Cloudflare Tunnel

runtime graph

8 nodes / 8 edges

System map

AI Home Lab API Gateway

Env: macOS Apple Silicon + Podman ComposeIngress: Cloudflare Tunnel

Problem: I built a local OpenAI-compatible gateway so my site, agents, and scripts could call one stable /v1 contract. Ollama, PostgreSQL, Redis, and Qdrant stay private on the Mac. The hard parts were auth, backpressure, model warmup, and zero public inbound ports.

My engineering note

The pattern is simple: Cloudflare handles ingress, FastAPI owns auth and request shape, and Ollama stays local. I keep qwen3.5 warm, limit context, and return 429 when the Mac is full.

Live path

chat intent: Astra chat surface -> TS agent layer

Path running

Mini Map
Interactive map

Zones, edges, and logs come from the case-study data model.

Architecture Decision

Why I chose this design.

Short decision notes tied to the code or config that mattered.

Decision

apps/model-gateway/src/clients/ollama.py

I kept concurrency in the gateway. When the Mac is full, callers get a clear 429 instead of hidden queue buildup inside Ollama.

apps/model-gateway/src/clients/ollama.pypython
python
class OllamaClient:    def __init__(self, settings):        self._chat_limiter = asyncio.BoundedSemaphore(            settings.ollama_chat_concurrency_limit        )
    async def _acquire_chat_slot(self):        try:            await asyncio.wait_for(                self._chat_limiter.acquire(),                timeout=self._settings.ollama_chat_acquire_timeout_seconds,            )        except TimeoutError as exc:            raise OllamaClientError(                message="Ollama chat concurrency limit reached.",                status_code=429,                code="chat_concurrency_limit",            ) from exc

Docs

Runbooks and specs.

Supporting docs for the system: architecture, diagrams, runbook, and development notes.

Loading docs...

Next case study

Enterprise Agentic RAG

BFSI workloads - I designed an agentic RAG pattern for long financial filings. The system needed layout-aware parsing, hybrid retrieval, ...

Read next

Work With Me

Need this level of architecture review?

Bring the hard system constraint: retrieval quality, agent failure modes, latency, evaluation, deployment topology, or technical market education.