Building a Private AI Home Lab API Gateway

Why this matters

A home lab becomes serious infrastructure when it has a stable API contract, private ingress, model backpressure, secrets discipline, and a repeatable production runbook.

Why I built it

I wanted my AI home lab to behave less like a weekend script and more like a small private platform. The website, agent workflows, scripts, and future tools needed one API shape they could trust. The model runtime could change, but clients should keep calling the same OpenAI-compatible /v1 interface.

The boundary matters. Client applications should not call Ollama directly, and the Mac should not open public inbound ports. The gateway owns authentication, request shape, usage logs, OpenAPI docs, and model backpressure. Ollama remains a private provider behind it.

This is the difference between a local model experiment and a reusable AI infrastructure lane. The same client can call the gateway with only base_url and api_key changes, while the backend keeps control over local hardware, model warmup, and failure behavior.

The production topology

The current topology is a macOS Apple Silicon server running a FastAPI model gateway through Podman Compose. Cloudflare Tunnel forwards public HTTPS traffic to 127.0.0.1:8000. Ollama listens privately on 127.0.0.1:11434. PostgreSQL stores API keys and usage logs, while Redis and Qdrant are wired for rate limiting, caching, and RAG expansion.

The model runtime is the official Ollama Darwin release 0.30.6 with qwen3.5:9b as the default model. The gateway passes OpenAI-compatible chat, completions, embeddings, responses, image, and model routes while preserving Ollama-specific options in the upstream request body.

Development and production are deliberately separated. Development runs on 127.0.0.1:8010 with infra/.env.dev and compose.dev.yaml. Production runs on 127.0.0.1:8000 with infra/.env and compose.yaml. That prevents local experiments from accidentally changing the public tunnel target.

Private AI Gateway Topology

Public traffic reaches only the FastAPI gateway. Ollama and the data plane stay private on the Mac.

Concurrency is a product decision

Local inference has a different failure mode from cloud inference. If ten clients hit the gateway at once, the API must decide whether to queue, reject, or overwhelm the local model runtime. I chose explicit backpressure at the gateway boundary.

The gateway uses a bounded async semaphore for chat calls. The configured API boundary is 10 concurrent chat requests. If a caller cannot acquire a slot within 0.25 seconds, it receives an OpenAI-style 429 with a chat_concurrency_limit code. That is much better than allowing unbounded request buildup inside Ollama.

One subtle lesson: gateway concurrency and model parallelism are not the same thing. qwen3.5:9b was observed exposing one active Ollama generation slot on this Mac, so the gateway can safely admit and protect up to 10 callers, while actual token generation may still serialize inside the model runtime.

apps/model-gateway/src/clients/ollama.pypython

python

class OllamaClient:    def __init__(self, settings):        self._chat_limiter = asyncio.BoundedSemaphore(            settings.ollama_chat_concurrency_limit        )
    async def _acquire_chat_slot(self):        try:            await asyncio.wait_for(                self._chat_limiter.acquire(),                timeout=self._settings.ollama_chat_acquire_timeout_seconds,            )        except TimeoutError as exc:            raise OllamaClientError(                message="Ollama chat concurrency limit reached.",                status_code=429,                code="chat_concurrency_limit",            ) from exc

class OllamaClient:    def __init__(self, settings):        self._chat_limiter = asyncio.BoundedSemaphore(            settings.ollama_chat_concurrency_limit        )
    async def _acquire_chat_slot(self):        try:            await asyncio.wait_for(                self._chat_limiter.acquire(),                timeout=self._settings.ollama_chat_acquire_timeout_seconds,            )        except TimeoutError as exc:            raise OllamaClientError(                message="Ollama chat concurrency limit reached.",                status_code=429,                code="chat_concurrency_limit",            ) from exc

Unified memory changes the rules

The server is an M1 Pro MacBook Pro with 32 GB unified memory. That is powerful enough for serious local AI, but it is not infinite. Every extra loaded model, long context, and parallel generation slot competes with the OS, containers, browser sessions, databases, and the model itself.

The stable profile keeps qwen3.5:9b warm with keep_alive=-1, limits Ollama to one loaded model, sets context length to 4096, enables flash attention, and uses q8_0 KV cache. That profile is designed to reduce cold-start delay without inviting swap pressure.

The operational lesson is that less can be faster. A smaller context window and one loaded model may produce better user-perceived latency than a more ambitious setup that starts swapping under real traffic.

Ollama LaunchAgent profileenv

env

OLLAMA_KEEP_ALIVE=-1OLLAMA_NUM_PARALLEL=10OLLAMA_MAX_QUEUE=10OLLAMA_MAX_LOADED_MODELS=1OLLAMA_CONTEXT_LENGTH=4096OLLAMA_FLASH_ATTENTION=1OLLAMA_KV_CACHE_TYPE=q8_0

OLLAMA_KEEP_ALIVE=-1OLLAMA_NUM_PARALLEL=10OLLAMA_MAX_QUEUE=10OLLAMA_MAX_LOADED_MODELS=1OLLAMA_CONTEXT_LENGTH=4096OLLAMA_FLASH_ATTENTION=1OLLAMA_KV_CACHE_TYPE=q8_0

Hardening the Mac behind Cloudflare

Cloudflare Tunnel makes the Mac safer because the public internet does not connect directly to local ports. But a tunnel is not a replacement for application security. The gateway still requires client API keys for /v1 routes and an admin secret for /admin routes.

The current production posture keeps Ollama, PostgreSQL, Redis, and Qdrant private. Cloudflare should target only http://127.0.0.1:8000. Admin routes should be protected with Cloudflare Access or a trusted allowlist, and public model routes should have Cloudflare WAF rate limits.

The next hardening layer is Redis-backed FastAPI rate limiting by API key. Cloudflare can limit abusive IP behavior at the edge, but API-key rate limits protect the system even when requests come from trusted networks or a leaked key is used from many IPs.

What changed in production

The production readiness work fixed practical problems, not theoretical ones. The Makefile merge conflict was resolved by keeping both the production commands and the macOS Ollama targets. Production Compose was corrected so PostgreSQL, Redis, and Qdrant have predictable localhost bindings and the gateway rebuilds before restart.

A Homebrew Ollama runtime produced a llama-server binary error, so the fix was to install the official Ollama 0.30.6 Darwin release and point the LaunchAgent to that binary. After tuning, the gateway warmed qwen3.5 successfully on startup.

The verification gate passed: Ruff format check, Ruff lint, mypy, and pytest. The production image built, the stack came up, /api/v1/health returned ok, /openapi.json loaded, and gateway logs showed Ollama warmup success with chat_concurrency_limit=10.

The final architecture is still a home lab, but it has the habits of production: repeatable commands, separate environments, health checks, OpenAPI docs, scoped auth, logs, model warmup, and backpressure.