Writing

Integrating Machine Learning in Large-Scale Products

2025

Machine Learning Systems Product

The model is rarely the product. I've shipped ML in a few different settings now — video understanding for ads at YouTube, content understanding and moderation AI at Roblox — and the same pattern keeps showing up: the model is one component in a much larger loop, and most of the value gets captured or lost in the parts around it.

The loop includes data contracts, feature generation, online serving, fallbacks, monitoring, human review, experimentation, and retraining. When something breaks in production, it's almost never inside the model architecture. It's at one of those boundaries — a feature went stale, a contract drifted, an upstream team renamed an event.

The boundary is the design problem

The mental model I use: ML integration is designing the boundary between statistical behavior and deterministic systems. Everything downstream of the model is deterministic and wants guarantees. The model can't give any. So the product has to know, explicitly, what it does when the model is wrong, stale, slow, unavailable, or staring at an input unlike anything it trained on.

That reframes the central question. Not "how good is the model?" but "how does the product behave when the model fails?" The first question has a leaderboard answer. The second one is the actual work: fallback behavior, confidence thresholds, segment-level monitoring, and clear ownership across the product, infra, data, and ML teams — because a failure at a boundary is, almost by definition, a failure between teams.

Offline numbers are an audition, not a verdict

Impressive offline performance collapses in production often enough that I've stopped being surprised by it. Real users produce inputs no eval set anticipated. Traffic mix shifts. Latency budgets bite, and the big accurate model loses to the small fast one. In moderation work this is especially stark: the inputs you most need to catch are the adversarial ones, designed specifically not to look like your training data.

So the rollout discipline matters more than the training run. Shadow mode first, then staged experiments with a rollback path you've actually exercised. And two views you read together: product outcomes and model health. Either one alone will mislead you — model metrics can look fine while the product quietly degrades, and product metrics can move for reasons that have nothing to do with the model.

Treat the pipeline as the product

Data pipelines are first-class infrastructure, not plumbing under the model team. Stable feature contracts, freshness monitoring, drift dashboards tied to product outcomes — none of it is glamorous, and all of it is where the reliability comes from. A mediocre model on a trustworthy pipeline beats a great model on a flaky one, because you can improve the first system and you can't even diagnose the second.

Good production ML is systems engineering with probabilistic components. If that sounds deflating, it shouldn't. The integration is where the engineering judgment lives, and it's the part that compounds.