Incident response
When something breaks, the path should be boring.
Operational incidents need a clear response path: notice, triage, contain, fix, verify, and learn.
The goal is calm recovery
When something breaks, the product needs a clear path: protect users, identify the failing layer, contain the risk, ship the fix, and verify production behavior. Heroics are less useful than a boring checklist that works.
Incident playbook
- 1
Confirm impact
Check whether the issue affects public pages, dashboard access, connector sync, recommendations, or account security.
- 2
Contain risk
Roll back, disable the risky route, pause a sync path, or add a temporary guard if needed.
- 3
Fix the cause
Patch the smallest safe surface and keep unrelated refactors out of the incident.
- 4
Verify production
Check the affected path in production or preview before declaring the incident resolved.
After the fix
The follow-up matters: add a test if the issue can recur, update the runbook if the response path changed, and record durable decisions in the knowledge base. A fix that teaches the system is better than a fix that only teaches the person who shipped it.
