Platform SRE

Observability for Living Worlds

From Lagos packet loss to Sydney GPU saturation, how we built a single pane of glass for truly global simulation health.

person

Koustav Mandal

Sep 22, 2025 • 8 min read

Engineer monitoring dashboards with global heatmaps

You can’t debug what you can’t see. In a single-region deployment, “see” usually means a handful of dashboards and a tracing system. In a geo-distributed simulation that spans dozens of data centers and thousands of nodes, “see” becomes existential.

Our goal at Ludotronics was to build one mental model of system health that worked equally well for an SRE in Dublin, a gameplay engineer in Vancouver, and a partner team in Seoul. That meant unifying telemetry across regions, stacks, and vendors without flattening away important detail.

Three layers of visibility

We organize observability into three layers:

Infrastructure — CPU, GPU, memory, disk, network per region.
Simulation — tick rates, queue depths, entity counts per shard.
Player experience — latency, error rates, disconnections per city and ISP.

Each layer feeds into a shared schema so we can correlate, for example, packet loss in Lagos with rubber-banding reports in nearby cities, or GPU saturation in Sydney with AI behavior changes.

GEO as a first-class dimension

Most dashboards treat “region” as a drop-down. We treat it as a map. Our primary overview is a globe where each city glows according to a blend of latency, error rate, and active sessions. Hovering over Lagos, you might see healthy simulation tick rates but elevated packet loss; hovering over Frankfurt, you might see GPU pressure.

This visualization isn’t just cosmetic. It’s how we teach new team members to reason about the system: what does a healthy planet look like? Deviations become visually obvious long before a wall of numbers would.

Sampling the right things

Full-fidelity tracing across every entity and player would melt our storage bills. Instead, we lean heavily on adaptive sampling. Quiet regions get higher sample rates; during big events we bias toward known hotspots and newly deployed code paths.

Crucially, we never down-sample signals tied to fairness — things like latency-sensitive competitive modes or security alerts. Those streams stay lossless, regardless of traffic level.

Closing the loop with automation

Observability without action is just a screensaver. Our telemetry feeds directly into autoscaling policies, routing decisions, and even content rollouts. If we see rising error rates for a new build in one region, we can automatically slow down the rollout elsewhere while on-call investigates.

For partners, we expose curated slices of the same data — per-title, per-region views that highlight health without leaking internal details. That shared visibility keeps conversations grounded: we’re all looking at the same world.

As our worlds grow more complex, observability becomes less about individual charts and more about shared intuition. The systems we’re building today are designed to scale that intuition to new genres, partners, and continents.