when does owning the model pay off?
Today we rent inference per token from Scaleway's serverless GLM-5.2 — cheap when quiet, scales with use. Owning a dedicated 8×H100 node flips that: a flat ~€22k/month whether busy or idle, but with prefix caching (our repeated prompt-core stops being recomputed), predictable cost, and the deepest form of sovereignty — EU compute we run ourselves. The two cost curves cross at a volume. Drag the knobs to find it. This is the detailed self-host crossover behind the unit economics and the business model.
the demand
the workload
the two prices
capacity model · advanced
why it's shaped this way
| Serverless price firm | Scaleway Generative APIs GLM-5.2: €1.80 / M input, €5.50 / M output (Jul 2026). Serverless has no prefix caching — the repeated prompt-core is billed in full every turn. Scaleway pricing, caching request |
| The node GLM-5.2 needs est | GLM (4.5/4.6-class) is a ~355B MoE (~32B active) → ~8×H100 in FP8. That's Scaleway H100-SXM-8-80G @ €30.06/hr ≈ €21,944/mo, always-on. An INT4 quant might fit a 4×H100 node (~€11k/mo) with a quality risk. GLM MoE, GPU pricing |
| Prefix caching is the prize firm | Dedicated deployments cache the shared prompt prefix per user — after the first hit the prompt-core isn't recomputed, cutting time-to-first-token and raising throughput per GPU. Not available on serverless. Managed Inference |
| Capacity is KV-cache-bound est | ~640 GB VRAM − ~355 GB weights ≈ ~285 GB for KV cache. Long contexts (our ~50k-token prefixes) are heavy, so in-flight requests scale inversely with context size — the real number comes from a benchmark. |
| Duty cycle & peak est | A builder is only mid-request a fraction of their active time (default 15%), and only a slice of monthly users are on at peak (default 5%). Both are tunable above. |
| Sovereignty firm | Serverless and dedicated both run on Scaleway in France — the EU claim holds either way. Owning the box adds control, not sovereignty. Buying our own hardware adds neither over Scaleway dedicated, only ops burden. |
how the math works
| Serverless / month | mau × turns/user × (in-tok × €/M-in + out-tok × €/M-out) |
| Dedicated / month | node €/hr × 730 hrs × nodes-needed |
| Capacity / node | in-flight-reqs (scaled by 50k ÷ context) ÷ duty-cycle |
| Nodes needed | ceil( (mau × peak%) ÷ capacity-per-node ), min 1 |
| Break-even | the mau where the serverless line crosses one node's flat cost |
Estimates, not quotes — the softest knobs are the capacity model and whether GLM-5.2 is deployable as a dedicated Managed Inference model at all (verify in the Scaleway console; else import custom weights). The clean move today: shrink context first (it lowers the serverless bill and raises node capacity), instrument real usage, then benchmark a dedicated node for an afternoon — it's hourly-billed, so a real measurement costs ~€120, not a month. Indicative, July 2026.