inference economics · serverless vs dedicated

when does owning the model pay off?

Today we rent inference per token from Scaleway's serverless GLM-5.2 — cheap when quiet, scales with use. Owning a dedicated 8×H100 node flips that: a flat ~€22k/month whether busy or idle, but with prefix caching (our repeated prompt-core stops being recomputed), predictable cost, and the deepest form of sovereignty — EU compute we run ourselves. The two cost curves cross at a volume. Drag the knobs to find it. This is the detailed self-host crossover behind the unit economics and the business model.

per-token vs flatprefix cachingglm-5.2 · 8×h100eu-sovereign compute

the demand

How many people build, and how hard.

monthly active users 300

502.5k5k

build-turns / user / month 300

A "turn" = one build or edit request.

the workload

How big each request is — the tokens we pay for.

context / turn (input tok) 50k

The resent document + prompt-core. Shrinking this lowers the serverless bill and lets one dedicated node hold more users.

output tokens / turn 5.0k

the two prices

Serverless is per-token; the node is flat per hour.

dedicated node

serverless €/M in

serverless €/M out

capacity model · advanced

How many users one node holds. Rough — the #213 benchmark replaces these.

in-flight reqs / node @50k ctx 35

KV-cache limited; auto-scales inverse to context size.

user duty cycle 15%

Share of active time a user is mid-request.

peak concurrency (% of mau) 5%

verdict at these settings

—

serverless / month—

↳ €/turn × turns—

dedicated / month—

↳ nodes × €/node—

monthly difference—

break-even (1 node)—

↳ in turns / month—

capacity / node (active users)—

peak concurrent demand—

nodes needed—

serverless (per-token) dedicated (flat; steps as nodes added) your mau

assumptions & sources

why it's shaped this way

Serverless price firm	Scaleway Generative APIs GLM-5.2: €1.80 / M input, €5.50 / M output (Jul 2026). Serverless has no prefix caching — the repeated prompt-core is billed in full every turn. Scaleway pricing, caching request
The node GLM-5.2 needs est	GLM (4.5/4.6-class) is a ~355B MoE (~32B active) → ~8×H100 in FP8. That's Scaleway `H100-SXM-8-80G` @ €30.06/hr ≈ €21,944/mo, always-on. An INT4 quant might fit a 4×H100 node (~€11k/mo) with a quality risk. GLM MoE, GPU pricing
Prefix caching is the prize firm	Dedicated deployments cache the shared prompt prefix per user — after the first hit the prompt-core isn't recomputed, cutting time-to-first-token and raising throughput per GPU. Not available on serverless. Managed Inference
Capacity is KV-cache-bound est	~640 GB VRAM − ~355 GB weights ≈ ~285 GB for KV cache. Long contexts (our ~50k-token prefixes) are heavy, so in-flight requests scale inversely with context size — the real number comes from a benchmark.
Duty cycle & peak est	A builder is only mid-request a fraction of their active time (default 15%), and only a slice of monthly users are on at peak (default 5%). Both are tunable above.
Sovereignty firm	Serverless and dedicated both run on Scaleway in France — the EU claim holds either way. Owning the box adds control, not sovereignty. Buying our own hardware adds neither over Scaleway dedicated, only ops burden.

how the math works

Serverless / month	mau × turns/user × (in-tok × €/M-in + out-tok × €/M-out)
Dedicated / month	node €/hr × 730 hrs × nodes-needed
Capacity / node	in-flight-reqs (scaled by 50k ÷ context) ÷ duty-cycle
Nodes needed	ceil( (mau × peak%) ÷ capacity-per-node ), min 1
Break-even	the mau where the serverless line crosses one node's flat cost

Estimates, not quotes — the softest knobs are the capacity model and whether GLM-5.2 is deployable as a dedicated Managed Inference model at all (verify in the Scaleway console; else import custom weights). The clean move today: shrink context first (it lowers the serverless bill and raises node capacity), instrument real usage, then benchmark a dedicated node for an afternoon — it's hourly-billed, so a real measurement costs ~€120, not a month. Indicative, July 2026.