Engineering AI Infrastructure

Using GNNs to Route Live Worlds

How we run graph neural networks in production to decide shard placement and migration in real time for thousands of interconnected regions.

person

Kumar Saurabh

Oct 27, 2025 • 10 min read

Abstract neural network lines over a city grid

Routing players between shards used to be a simple rules engine. Check region, check ping, check capacity, assign a server. That worked until our worlds started behaving less like isolated arenas and more like interconnected cities with deep cross-region dependencies.

A decision to move one district from Frankfurt to Amsterdam doesn’t just affect the players in that district; it ripples across every connected neighborhood, AI system, and data pipeline. Classic heuristics started breaking down, so we turned to the tool purpose-built for learning over graphs: graph neural networks.

Encoding the world as a routing graph

Our world graph already represents spatial adjacency and authority. For routing, we derive a separate routing graph where nodes are shards and edges encode potential migration paths. Each node carries features like current load, historical churn, GPU headroom, and regional latency profiles.

Edges get their own attributes: bandwidth between regions, failure rates, and even legal constraints (for example, whether certain categories of data can cross from EU to US). This enriched graph becomes the input to our GNN.

What the GNN actually predicts

We don’t ask the GNN to magically “solve routing.” Instead, we train it to estimate future cost for candidate moves. Given the current graph state and a proposed migration (for example, moving a shard from Oregon to Iowa), the model predicts impacts to latency, stability, and capacity over the next few minutes.

A lightweight planner then searches over a small set of candidate moves and chooses the set that minimizes long-term cost while respecting constraints. The GNN is a fast, learned cost function; the planner remains explicit and debuggable.

Training data from real traffic

We bootstrap training with historical routing decisions and their outcomes, covering everything from quiet weekday mornings in Sydney to overloaded Friday nights in New York. Each example encodes what we did, what the graph looked like, and how key metrics evolved.

Because infra incidents tend to be underrepresented in raw logs, we also run controlled chaos experiments — temporarily degrading links between certain regions — to generate additional examples of “what not to do.”

Running GNNs safely in production

A bad routing decision can page an entire on-call rotation. To keep things safe, we deploy new GNN models behind a shadow evaluation path. They score candidate moves alongside our existing heuristic system for days or weeks before we even consider giving them partial control.

Even once live, the models operate under strict guardrails: max migrations per minute, hard caps on cross-ocean moves, and explicit don’t-touch lists during sensitive events or partner launches.

The result is a routing layer that adapts to patterns we didn’t explicitly code for: new peering arrangements, shifting regional player bases, and the constant churn of the real-world internet. GNNs aren’t a silver bullet, but in our stack they’ve become a powerful tool for keeping living worlds online and responsive.