← Case studies

A quality and observability program

Drove a cross-team quality program — consistent alerting, real dashboards, and a usable on-call SOP.

Role
Quality point-of-contact
Timeframe
Two quarters
Stack
SLIs / SLOs · Dashboards · Alerting · On-call process
  • Standardised alarm criteria across multiple business units that had each defined "broken" differently.
  • Built SLI and stability dashboards so health was observable, not anecdotal.
  • Rewrote the on-call SOP and raised alarm response rates while cutting false alarms.

An abstract case study. Details are generalised; no employer or confidential figures are included.

The problem

Across several teams on a payments-adjacent platform, everyone defined “broken” differently. Alarm thresholds were inconsistent, dashboards didn’t agree, and a meaningful share of pages were false alarms — which is corrosive, because an on-call engineer who learns to distrust alerts is slower on the one that’s real. Nobody could answer “is the platform healthy right now?” with data.

The approach

I took this on as the quality point-of-contact for the platform.

  • Standardised alarm criteria across the business units so a P0 meant the same thing everywhere, and severity mapped to real user impact.
  • Built SLI and stability dashboards so service health was a glanceable, shared source of truth rather than tribal knowledge.
  • Rewrote the on-call SOP into something an engineer could actually follow at 3am: clear triage steps, ownership, and escalation paths.

Outcome

Alarm response rates went up and false alarms went down — the two metrics that together mean on-call is trustworthy. More durably, the teams ended up with a shared, data-backed definition of health, which is the foundation everything else (SLOs, error budgets, capacity planning) builds on.

This is the operational side of frontend platform work: shipping features is only half the job; knowing they’re healthy in production is the other half.