Rust Microservices for Deterministic Playout
Why we rewrote our core simulation services in Rust and how it changed our reliability story in regions with unstable networks.
Before Rust, our simulation services were a patchwork of languages and frameworks. Most of the time they behaved, but under stress — think launch day traffic from Seoul, São Paulo, and San Francisco all at once — subtle timing issues turned into hard-to-reproduce bugs.
We didn’t just want more performance; we wanted deterministic playout: given the same sequence of inputs, every region should simulate the same outcome, bit for bit. That’s what led us to rewrite our core in Rust.
Why determinism matters for GEO
In a single data center, a minor race condition might show up as a blink-and-you-miss-it glitch. In a geo-distributed world, it can become a consistency split — players in Europe see a different outcome than players in North America for the same encounter.
Deterministic playout lets us confidently replicate state across regions, fail over between them, and even rewind specific shards for debugging. That’s only possible if our simulation stack is free of hidden sources of nondeterminism like data races and implicitly seeded randomness.
Rust as our simulation lingua franca
Rust gives us low-level control comparable to C++ while enforcing strong guarantees at compile time. We express core simulation types — entities, components, events — in a shared crate that’s reused across services. The borrow checker forces us to model ownership and lifetimes explicitly.
For network I/O, we lean on async Rust and structured concurrency. Critical paths that must remain deterministic avoid shared mutable state entirely, relying instead on message passing and explicit queues.
Surviving unreliable networks
Many of our partners operate in regions with highly variable network quality. Rather than assuming perfect links, we design our Rust microservices to tolerate packet loss, jitter, and intermittent disconnections without diverging state.
All cross-region communication is idempotent and versioned. If a replication stream between Johannesburg and Frankfurt drops for a few seconds, the receiving side can request the exact missing range of events and confidently reapply them in order.
Operational wins
The migration to Rust wasn’t cheap, but it paid off quickly. We saw simulation CPU usage drop by double digits in our busiest regions, freeing up headroom for more complex AI. More importantly, we saw a measurable reduction in cross-region desync incidents.
Our on-call runbooks also got simpler. Many classes of bugs — use-after-free, accidental mutation from multiple threads — simply stopped appearing. When issues do arise, we can reproduce them locally with confidence thanks to deterministic playout.
Rewriting anything at the core of your stack is scary. For us, Rust unlocked the level of predictability we needed to treat GEO as a feature, not a source of constant risk. It’s now the default for any new simulation service we build.