A quality and observability program

An abstract case study. Details are generalised; no employer or confidential figures are included.

The problem

Across several teams on a payments-adjacent platform, everyone defined “broken” differently. Alarm thresholds were inconsistent, dashboards didn’t agree, and a meaningful share of pages were false alarms — which is corrosive, because an on-call engineer who learns to distrust alerts is slower on the one that’s real. Nobody could answer “is the platform healthy right now?” with data.

The approach

I took this on as the quality point-of-contact for the platform.

Standardised alarm criteria across the business units so a P0 meant the same thing everywhere, and severity mapped to real user impact.
Built SLI and stability dashboards so service health was a glanceable, shared source of truth rather than tribal knowledge.
Rewrote the on-call SOP into something an engineer could actually follow at 3am: clear triage steps, ownership, and escalation paths.

Outcome

Alarm response rates went up and false alarms went down — the two metrics that together mean on-call is trustworthy. More durably, the teams ended up with a shared, data-backed definition of health, which is the foundation everything else (SLOs, error budgets, capacity planning) builds on.

This is the operational side of frontend platform work: shipping features is only half the job; knowing they’re healthy in production is the other half.