Recipe 01 of 10 in the Agent Blueprint Recipes arc:
Foundation → Knowledge → Grounding → Orchestration → Thread Memory → User Memory → Observability → Guardrails → Actions → Simulation
You called OpenAI from a Python script and it worked. Now what?
This recipe is the bridge between that script and something you would actually deploy. It is a small FastAPI service that streams a Nebius chat completion as Server-Sent Events. The Nebius integration is one line. Everything else — the structured logs, the Prometheus metrics, the rate limit, the security headers, the Dockerfile, the tests — is the shape of a service you can hand to ops without blushing.
It is deliberately not an agent framework. There is no planner, no tool loop, no graph. The point is to get the boundaries right — config, transport, observability, lifecycle — so that the recipes that follow can add capability without re-litigating infrastructure.
What you'll build
A single-endpoint API:
| Endpoint | Description |
|---|---|
POST /agent/run | Streams status, token, done SSE events |
GET /healthz | Liveness probe |
GET /readyz | Readiness probe |
GET /metrics | Prometheus scrape endpoint |
GET /docs | OpenAPI / Swagger UI |
Everything is typed, every env var validated, every request tagged with a request ID, every Nebius call wrapped in tenacity-backed retries. This foundation recipe intentionally has no database dependency. It is also the deployment baseline that later recipes extend.
Prerequisites
- Python 3.12+
- uv —
curl -LsSf https://astral.sh/uv/install.sh | sh - A Nebius API key — get one from the Nebius console
- (optional) Docker, if you want to build the production image
No partner service account is needed until the knowledge and memory recipes.
Run it
git clone https://github.com/nebius/nebius-partner-cookbook.git
cd nebius-cookbook/cookbooks/01-first-agent-on-nebius
cp .env.example .env
# Open .env and fill NEBIUS_API_KEY
uv sync
make dev
In another terminal:
curl -N -X POST http://localhost:8000/agent/run \
-H 'content-type: application/json' \
-d '{"prompt":"Explain Nebius AgentKit in one paragraph."}'
You'll see SSE events arriving:
event: status
data: {"phase":"thinking"}
event: token
data: {"text":"Nebius"}
event: token
data: {"text":" Token"}
…
event: status
data: {"phase":"done"}
event: done
data: {}
Multi-turn requests
The server is stateless — it keeps no session. To hold a conversation, the client replays prior turns in the request body:
curl -N -X POST http://localhost:8000/agent/run \
-H 'content-type: application/json' \
-d '{
"prompt": "And who maintains it?",
"history": [
{"role": "user", "content": "What is Nebius AgentKit?"},
{"role": "assistant", "content": "It is ..."}
]
}'
history is optional and capped (40 turns, 8 KB per message) by the request schema. Statelessness is a deliberate architectural choice — see Design decisions below.
Cookbook #5 adds thread memory after this baseline is clear.
Walk-through
The directory
app/
├── main.py # FastAPI app, lifespan, middleware
├── config.py # Pydantic Settings, validated at boot
├── routes/
│ ├── agent.py # POST /agent/run with SSE
│ └── health.py # /healthz, /readyz
├── schemas/
│ └── agent.py # Pydantic I/O models
├── core/
│ ├── nebius_client.py # OpenAI SDK pointed at Nebius
│ └── agent.py # The agent logic
└── observability/
├── logging.py # structlog (JSON in prod)
├── metrics.py # Prometheus counters/histograms
└── middleware.py # Request IDs, security headers, body-size limit
The split is by boundary, not by layer: core/ is the only place that knows about Nebius, routes/ is the only place that knows about HTTP and SSE, observability/ is cross-cutting. You can unit-test core/agent.py without standing up a server, and you can change the transport (SSE → WebSocket) without touching core/.
The Nebius integration
The SDK swap is one line — base_url points at Nebius:
AsyncOpenAI(
api_key=settings.nebius_api_key,
base_url=str(settings.nebius_base_url), # https://api.studio.nebius.ai/v1/
timeout=httpx.Timeout(connect=5.0, read=60.0, write=10.0, pool=5.0),
max_retries=0, # retries are owned by tenacity, not the SDK
)
Nebius AgentKit is OpenAI-wire-compatible, so the OpenAI SDK works unchanged — the same chat.completions.create(...) call, the same streaming protocol. That compatibility is the reason every recipe in this cookbook can stay on one client library.
Later cookbooks keep this client boundary and add capabilities around it.
Two decisions worth calling out:
- Timeouts are split, not global. A 5 s connect timeout fails fast on a dead endpoint; a 60 s read timeout tolerates a slow generation. A single global timeout forces you to choose one or the other.
max_retries=0on the SDK. Retry policy lives intenacity(see below) so it is visible, testable, and swappable in one place. Two retry layers stacked silently is a production incident waiting to happen.
build_nebius_client() returns a process-wide singleton. The point is the underlying httpx connection pool: reusing it avoids a TLS handshake on every request. A fresh client per request is a common and expensive mistake.
Retries
@retry(
retry=retry_if_exception_type((httpx.HTTPError, httpx.TimeoutException)),
wait=wait_exponential(multiplier=0.5, min=0.5, max=8.0),
stop=stop_after_attempt(3),
reraise=True,
)
Retries fire on transport errors only. A 4xx (bad key, malformed request) or a content error is not retried — replaying it just burns quota and latency for the same failure. Exponential backoff with a cap keeps a struggling upstream from being hammered.
The observability
Every request gets a UUID request ID, returned in the x-request-id header and bound to the structlog context so every log line for that request carries it. In production (ENV=production) logs render as JSON; in dev they render as human-readable lines. One renderer swap, same call sites.
A sample log line in production:
{
"event": "agent_started",
"level": "info",
"request_id": "9a8f3c…",
"path": "/agent/run",
"model": "meta-llama/Llama-3.3-70B-Instruct",
"timestamp": "2026-06-04T12:31:42.103Z"
}
Prometheus metrics include:
http_requests_total{method,path,status}— request counterhttp_request_duration_seconds{method,path}— latency histogramnebius_tokens_total{model,type}— tokens streamed back from Nebiusnebius_request_duration_seconds{model}— Nebius call duration
The path label is the route template (/agent/run), not the raw URL — a deliberate choice, since high-cardinality labels (per-user, per-ID paths) are the fastest way to melt a Prometheus instance.
Later recipes keep these same metrics while adding memory, guardrails, and actions.
The streaming
The route handler returns a StreamingResponse with text/event-stream. Events are named — status, token, done, heartbeat, error — rather than raw text, so a client can switch on the event name instead of parsing prose. The agent itself is an async generator yielding typed Events; the route's only job is to serialise each one into the SSE wire format.
Two production details that are easy to miss:
- Heartbeats. A
heartbeatevent is emitted every 15 s of silence. Idle SSE streams are killed by load balancers and proxies; the heartbeat keeps the connection scored as live. The response also setsx-accel-buffering: noso nginx forwards chunks immediately instead of buffering the whole body. - Disconnect cancellation. A background task polls
request.is_disconnected()and sets acancel_event. The agent checks it between tokens and stops pulling from Nebius the moment the client goes away — otherwise a user closing a tab still costs you a full generation.
The hardening
- The daily per-IP rate limiter protects backend routes before they call Nebius; it uses Redis when
RATE_LIMIT_REDIS_URLis set and process-local memory otherwise (RATE_LIMIT_REQUESTS_PER_DAY, default 25). - CORS is allowlist-based —
CORS_ORIGINSis parsed and validated at boot, never*. - Security headers:
x-content-type-options,referrer-policy,strict-transport-security. - Request bodies above 64 KB are rejected with 413 before the body is read, via a
content-lengthcheck in middleware. - Lifespan handlers log startup and shutdown; on shutdown the HTTP client is closed and in-flight requests are allowed to drain.
Middleware order is load-bearing. ASGI wraps middleware last-added-runs-first, so the stack is arranged so CORS, size checks, security headers, request IDs, metrics, and rate limiting all happen before route work. Re-ordering add_middleware calls silently changes behaviour — there is a comment in main.py spelling out the request path.
Rate limiting
The limiter is enabled by default and allows 25 backend requests per IP per rolling 24-hour window.
Set these values in .env:
RATE_LIMIT_ENABLED=true
RATE_LIMIT_REQUESTS_PER_DAY=25
RATE_LIMIT_REDIS_URL=
RATE_LIMIT_TRUST_PROXY_HEADERS=false
Leave RATE_LIMIT_REDIS_URL empty for local in-memory counters, or point it at Redis for a shared counter across replicas:
RATE_LIMIT_REDIS_URL=redis://localhost:6379/0
When the quota is exceeded, the backend returns JSON 429 before opening an SSE stream:
{
"detail": "daily rate limit exceeded",
"limit": 25,
"window": "24h",
"retryAfterSeconds": 12345
}
See the full deployment notes in docs/rate-limiting.md.
Design decisions
Why a stateless server? No session store means any instance can serve any request — horizontal scaling is just "add a replica," and a crash loses nothing. The cost is bandwidth: the client resends history each turn. For a chat workload that is a few KB; when it stops being cheap (long transcripts, RAG context), the answer is a server-side store keyed by session ID, not in-process state. Recipe #5 (Short-Term Memory) takes that step deliberately.
Why no agent framework? A framework would hide exactly the boundaries this recipe exists to show. core/agent.py is ~50 lines and is explicitly designed to be subclassed — the later recipes add planning, knowledge, and tools on top of this same shape.
Why duplicate infrastructure per cookbook? Every recipe is autonomous: cloning one directory is enough to run it, with no shared base package. That costs some duplication and buys a reader the ability to study one cookbook in isolation. It is a documentation decision, not an architectural recommendation — in a real monorepo, factor the common middleware into a shared library.
Failure modes
| Symptom | Cause | Handling |
|---|---|---|
| 500 on first request | NEBIUS_API_KEY unset or invalid | Settings validate at boot; a bad key surfaces on first Nebius call as a non-retried 4xx |
| Stream stalls, no tokens | Upstream slow generation | 60 s read timeout, then a retried httpx.TimeoutException; client sees an error event |
| Stream ends early | Client disconnected | Expected — disconnect watcher cancels the upstream call |
| 429 responses | Daily per-IP rate limit hit | Back off; raise RATE_LIMIT_REQUESTS_PER_DAY only behind authentication |
| 413 responses | Body over 64 KB | Raise MAX_REQUEST_BYTES, or trim client-side history |
The except Exception in the route logs the full stack trace but emits only {"detail": "internal error"} to the client — internal details never cross the wire.
Test it
make test
Tests use respx to mock every Nebius call. No network is hit. CI passes on a laptop in airplane mode — which also means the test suite is a faithful, runnable spec of the SSE contract.
Ship it
Build the production image:
make docker
The Dockerfile is multi-stage, uses astral-sh/uv for the build, and ends in a slim Python 3.12 image running as a non-root user with a healthcheck wired to /healthz.
Deployment-specific instructions for Nebius Compute live in docs/deployment.md.
Going further
- Cookbook #2 — Domain Knowledge with Pinecone Nexus — give the agent a grounded, compiled knowledge base instead of a frozen training cutoff.
- Tracing. Feed an inbound
traceparentinto the request-ID middleware and emit OpenTelemetry spans around the Nebius call — the structlog context is already the right place to bind a trace ID. - Authentication. Drop a JWT verifier in as a FastAPI
Depends; it composes cleanly with the existing rate limiter and request-ID middleware. - Backpressure. Under heavy fan-out, cap concurrent Nebius calls with an
asyncio.Semaphorein the client so one traffic spike can't open unbounded upstream connections.
License
MIT — see LICENSE.