makemode
inference economics · serverless vs dedicated

when does owning the model pay off?

Today we rent inference per token from Scaleway's serverless GLM-5.2 — cheap when quiet, scales with use. Owning a dedicated 8×H100 node flips that: a flat ~€22k/month whether busy or idle, but with prefix caching (our repeated prompt-core stops being recomputed), predictable cost, and the deepest form of sovereignty — EU compute we run ourselves. The two cost curves cross at a volume. Drag the knobs to find it. This is the detailed self-host crossover behind the unit economics and the business model.

per-token vs flatprefix cachingglm-5.2 · 8×h100eu-sovereign compute

the demand

How many people build, and how hard.
502.5k5k
A "turn" = one build or edit request.

the workload

How big each request is — the tokens we pay for.
The resent document + prompt-core. Shrinking this lowers the serverless bill and lets one dedicated node hold more users.

the two prices

Serverless is per-token; the node is flat per hour.
dedicated node
serverless €/M in
serverless €/M out

capacity model · advanced

How many users one node holds. Rough — the #213 benchmark replaces these.
KV-cache limited; auto-scales inverse to context size.
Share of active time a user is mid-request.
verdict at these settings
serverless / month
↳ €/turn × turns
dedicated / month
↳ nodes × €/node
monthly difference
break-even (1 node)
↳ in turns / month
capacity / node (active users)
peak concurrent demand
nodes needed
serverless (per-token) dedicated (flat; steps as nodes added) your mau
assumptions & sources
why it's shaped this way
Serverless price firmScaleway Generative APIs GLM-5.2: €1.80 / M input, €5.50 / M output (Jul 2026). Serverless has no prefix caching — the repeated prompt-core is billed in full every turn. Scaleway pricing, caching request
The node GLM-5.2 needs estGLM (4.5/4.6-class) is a ~355B MoE (~32B active) → ~8×H100 in FP8. That's Scaleway H100-SXM-8-80G @ €30.06/hr ≈ €21,944/mo, always-on. An INT4 quant might fit a 4×H100 node (~€11k/mo) with a quality risk. GLM MoE, GPU pricing
Prefix caching is the prize firmDedicated deployments cache the shared prompt prefix per user — after the first hit the prompt-core isn't recomputed, cutting time-to-first-token and raising throughput per GPU. Not available on serverless. Managed Inference
Capacity is KV-cache-bound est~640 GB VRAM − ~355 GB weights ≈ ~285 GB for KV cache. Long contexts (our ~50k-token prefixes) are heavy, so in-flight requests scale inversely with context size — the real number comes from a benchmark.
Duty cycle & peak estA builder is only mid-request a fraction of their active time (default 15%), and only a slice of monthly users are on at peak (default 5%). Both are tunable above.
Sovereignty firmServerless and dedicated both run on Scaleway in France — the EU claim holds either way. Owning the box adds control, not sovereignty. Buying our own hardware adds neither over Scaleway dedicated, only ops burden.
how the math works
Serverless / monthmau × turns/user × (in-tok × €/M-in + out-tok × €/M-out)
Dedicated / monthnode €/hr × 730 hrs × nodes-needed
Capacity / nodein-flight-reqs (scaled by 50k ÷ context) ÷ duty-cycle
Nodes neededceil( (mau × peak%) ÷ capacity-per-node ), min 1
Break-eventhe mau where the serverless line crosses one node's flat cost

Estimates, not quotes — the softest knobs are the capacity model and whether GLM-5.2 is deployable as a dedicated Managed Inference model at all (verify in the Scaleway console; else import custom weights). The clean move today: shrink context first (it lowers the serverless bill and raises node capacity), instrument real usage, then benchmark a dedicated node for an afternoon — it's hourly-billed, so a real measurement costs ~€120, not a month. Indicative, July 2026.