Python FastAPI gateway · Case Study 01

AI Home Lab API Gateway

The pattern is simple: Cloudflare handles ingress, FastAPI owns auth and request shape, and Ollama stays local. I keep qwen3.5 warm, limit context, and return 429 when the Mac is full.

PRODUCTIONBack to case studies

status

PRODUCTION

environment

macOS Apple Silicon + Podman Compose

ingress

Cloudflare Tunnel

runtime graph

8 nodes / 8 edges

View system map Open docs

System map

AI Home Lab API Gateway

Env: macOS Apple Silicon + Podman ComposeIngress: Cloudflare Tunnel

Problem: I built a local OpenAI-compatible gateway so my site, agents, and scripts could call one stable /v1 contract. Ollama, PostgreSQL, Redis, and Qdrant stay private on the Mac. The hard parts were auth, backpressure, model warmup, and zero public inbound ports.

My engineering note

The pattern is simple: Cloudflare handles ingress, FastAPI owns auth and request shape, and Ollama stays local. I keep qwen3.5 warm, limit context, and return 429 when the Mac is full.

Live path

chat intent: Astra chat surface -> TS agent layer

Path running

Failure modes

Mode

Readout:Tunnel, FastAPI, and Ollama are online. The gateway is holding the 10-chat limit.

Signals

Ingress

Cloudflare

Tunnel targets only the localhost production gateway on port 8000.

Gateway limit

10 chats

Bounded async semaphore returns 429 when slots are saturated.

Model runtime

Ollama 0.30.6

Official Darwin release serving qwen3.5:9b through /v1.

Memory profile

q8 KV

One loaded model, 4096 context, flash attention, keep_alive forever.

Run logs

> WEBSITE: Astra request enters Next.js /api/chat on Vercel.

> AGENT_TS: Gateway baseURL normalized to the OpenAI-compatible /v1 contract.

> EDGE: Cloudflare Tunnel forwards HTTPS traffic to the private macOS gateway.

> FASTAPI: X-API-Key or bearer token validated against PostgreSQL key hash.

> LIMITER: Chat request acquired one of 10 bounded gateway slots.

> OLLAMA: AsyncOpenAI client forwards keep_alive=-1 to official Ollama 0.30.6.

Architecture Decision

Why I chose this design.

Short decision notes tied to the code or config that mattered.

Decision

apps/model-gateway/src/clients/ollama.py

I kept concurrency in the gateway. When the Mac is full, callers get a clear 429 instead of hidden queue buildup inside Ollama.

Code noteSee engineering notes

apps/model-gateway/src/clients/ollama.pypython

python

class OllamaClient:    def __init__(self, settings):        self._chat_limiter = asyncio.BoundedSemaphore(            settings.ollama_chat_concurrency_limit        )
    async def _acquire_chat_slot(self):        try:            await asyncio.wait_for(                self._chat_limiter.acquire(),                timeout=self._settings.ollama_chat_acquire_timeout_seconds,            )        except TimeoutError as exc:            raise OllamaClientError(                message="Ollama chat concurrency limit reached.",                status_code=429,                code="chat_concurrency_limit",            ) from exc

class OllamaClient:    def __init__(self, settings):        self._chat_limiter = asyncio.BoundedSemaphore(            settings.ollama_chat_concurrency_limit        )
    async def _acquire_chat_slot(self):        try:            await asyncio.wait_for(                self._chat_limiter.acquire(),                timeout=self._settings.ollama_chat_acquire_timeout_seconds,            )        except TimeoutError as exc:            raise OllamaClientError(                message="Ollama chat concurrency limit reached.",                status_code=429,                code="chat_concurrency_limit",            ) from exc

Docs

Runbooks and specs.

Supporting docs for the system: architecture, diagrams, runbook, and development notes.

Loading docs...

Next case study

Enterprise Agentic RAG

BFSI workloads - I designed an agentic RAG pattern for long financial filings. The system needed layout-aware parsing, hybrid retrieval, ...

Need this level of architecture review?

Bring the hard system constraint: retrieval quality, agent failure modes, latency, evaluation, deployment topology, or technical market education.

Start Advisory Intake View GitHub