Writing

Designing Systems for Ultra-High Scale

2025

Systems Infrastructure Scale

Scale is not one number. A system can be hard because it gets too many requests, holds too much data, spans too many regions, leans on too many dependencies, or has too many people operating it. The systems that actually hurt are several of these at once, and the fix for one usually makes another worse.

At Uber I worked on Rosetta, the internationalization service, which at peak served over a million requests per second. Nothing in it was clever. What made it work was a pile of boring decisions about what was allowed to fail and how. That part of high-scale work doesn't show up in architecture diagrams, so it's the part I want to write down.

Rare becomes routine

Small systems forgive you for blurring things together: read paths and write paths, control plane and data plane, hot data and cold data, synchronous and asynchronous work. Big systems don't. Small inefficiencies become real money. A bug that fires once in ten million requests fires every few seconds. "Rare" stops being a defense and becomes a schedule.

That shifts the design question. It's no longer "how fast is this on the happy path?" It's "what does degradation look like, and is it predictable?" A system that gets gradually worse under load is operable. A system that's great until it falls off a cliff is an incident review that hasn't picked its date yet.

The question I ask every dependency

If I had to compress the discipline into one question, it's this: what happens to the hot path when this thing is down? Not slow — down. Config service, feature flags, experiment allocation, rate limits, routing policy. Each of these is harmless until someone makes it a synchronous call in the serving path, at which point a control-plane hiccup becomes a global outage.

The control plane is allowed to be slow and consistent. The data plane has to keep serving through stale config, partial failure, and dependency slowness. So you decide, explicitly, which decisions are made synchronously, which are cached, which are eventually consistent, and what the safe default is when the answer isn't available. If nobody on the team can say what the service does with a stale read, the real answer is usually "something surprising, at 3am."

Ownership is part of the architecture

The lesson that took me longest: unclear ownership fails the same way an unclear interface does. At sufficient scale, the org chart is a runtime dependency. When a metric degrades and three teams each plausibly own the fix, the fix happens slowly, and slow fixes at scale compound into risk. Ownership boundaries deserve the same explicitness as service boundaries, and they should usually be the same boundaries.

Where I'd start

Observability and load shedding, before you need them desperately. By the time you need them, nobody is calm enough to build them well. Then test the failures you'll actually get — brownouts, regional degradation, one dependency at ten times its normal latency — not just the clean failover your runbook describes. Real failures are partial, and partial failures are where untested assumptions live.

The skill is not making a system big. Plenty of big systems exist. The skill is making a big system simple enough to reason about on a bad day, because every system eventually has one.