SREya site reliability engineer avatar
AI Site Reliability Engineer

SLOs that reflect user pain.
Incidents resolved, not survived.

SREya designs error budgets before you need them, runs structured incident command that cuts MTTR in half, and delivers quarterly reliability reviews that give your on-call team a roadmap instead of a burnout spiral.

Selected Work

📡
SLO Governance
Error budget burn-rate alert caught a degradation before users noticed.
FRAIM · slo-design-and-implementation · registry/jobs/ai-employee/dev-ops/
Problem
A B2B SaaS company's on-call team was paged constantly — alerts fired on CPU usage and response time thresholds set years ago by someone who left the company. No SLOs existed. The team was alert-fatigued, and the first time they heard about real user impact was in a support ticket.
What SREya Can Build
SREya ran the SLO design process: defined SLIs tied to user journeys (checkout availability, API latency P99, search success rate), set error budgets with a 28-day rolling window, and implemented burn-rate alerting with two windows (1h fast burn, 6h slow burn). Replaced 47 noisy threshold alerts with 6 user-facing SLO alerts.
Possible Outcome
Teams that move from threshold alerting to SLO-based burn-rate alerting typically see 60–80% alert volume reduction with higher signal quality. SREya's SLO design can give on-call engineers context — not just a number, but an error budget percentage and time-to-breach — so the first page carries enough information to triage without a call bridge.
Live Artifact — SLO Error Budget Dashboard · 28-Day Window
SLO Error Budget Status · May 2026
3 SLOs
Defined
1 Alert
Fired
SLO Target Current Budget Used Status
Checkout Avail. 99.9% 99.93% 28% ● HEALTHY
API Latency P99 < 400ms 380ms 71% ● BURNING FAST
Search Success 99.5% 99.6% 12% ● HEALTHY
API latency burn-rate alert fired at 18:43 UTC — 6h window exceeded 10× burn threshold. Paged IC before any user-visible impact. Resolved in 22 minutes. Error budget replenished within 4 hours.
🚨
Incident Command
SEV2 resolved in 28 minutes. Blameless postmortem same day.
FRAIM · incident-response · registry/jobs/ai-employee/dev-ops/
Problem
A startup's incident management was ad-hoc: whoever noticed the problem became the IC by default, Slack filled with overlapping hypotheses, and status updates went out 40 minutes after the incident started. Customers were finding out from Twitter, not from the status page.
What SREya Can Build
SREya ran structured incident command: classified the incident as SEV2 in under 2 minutes using the severity matrix, assigned roles (IC, Comms Lead, Tech Lead, Scribe), opened a dedicated incident channel, posted a status page update within 5 minutes, and drove timeboxed investigation rounds — 10 minutes per hypothesis. Closed the incident bridge when resolution was confirmed.
Possible Outcome
Structured incident command with SREya typically reduces MTTR by 40–60% — not because the debugging is different, but because the coordination overhead is eliminated. SREya's blameless postmortem same-day policy means action items are captured while context is fresh, and the team can move forward without a blame culture creating the next incident.
Live Artifact — Incident Timeline · INC-2847 (SEV2)
INC-2847 · Payment Service Degradation · SEV2 RESOLVED 28 MIN
IC
SREya
Tech Lead
On-call Eng
Comms Lead
CX Lead
Impact
~340 users
14:02
SLO alert fired — API P99 burn rate 18× over 1h window
Payment service latency spike detected. IC role assigned.
14:04
Severity classified: SEV2 · Roles assigned
Status page incident opened. Customer comms drafted.
14:07
Status page update published — "Investigating payment latency"
5 min from alert to customer-visible acknowledgment.
14:17
Root cause identified: DB connection pool exhausted
New deployment at 13:55 introduced a connection leak. Rollback initiated.
14:30
Rollback complete · Latency nominal · SLO healthy
Status page resolved. Blameless postmortem scheduled 16:00 same day.
✓ Resolved in 28 minutes · Blameless postmortem completed 4 hours later · 3 action items tracked to completion
🔍
Reliability Audit
Three runbook gaps closed. On-call rotation stabilized within 6 weeks.
FRAIM · reliability-review · registry/jobs/ai-employee/dev-ops/
Problem
A 30-person engineering team had 4 engineers on rotation and 2 of them threatening to quit. Runbooks were outdated or missing for 40% of alert types. Every on-call shift involved at least one "this runbook doesn't match what I'm seeing" moment. The team was burning error budget faster than they were shipping features.
What SREya Can Build
SREya ran a quarterly reliability review: audited all SLO performance against error budgets, reviewed all runbooks against recent incident timelines to find gaps, scored each gap by frequency × impact, and produced a prioritized action plan. Identified 3 runbook gaps responsible for 64% of on-call escalations and recommended rotation changes to reduce per-engineer burden.
Possible Outcome
A quarterly reliability review with SREya gives engineering leadership a defensible view of on-call health — not just anecdotal "it's bad". Teams that close the top 3 runbook gaps typically see escalation frequency drop by 50–70%, and rotation stabilization follows naturally when on-call is predictable rather than chaotic.
Live Artifact — Q2 2026 Reliability Review · Action Plan
Reliability Review · Q2 2026 · Priority Findings
8 findings
Total
3 HIGH
Priority
HIGH
No runbook for DB failover
Triggered in 4 of last 6 incidents. Average escalation adds 18 minutes to MTTR.
→ Author failover runbook; test in staging by June 15
HIGH
SLO targets pre-date current traffic patterns
P99 latency SLO set at 400ms. Current P95 baseline is 310ms. Targets stale since Jan 2025.
→ Recalibrate SLOs against 90-day traffic baseline
HIGH
Payment service has no on-call owner documented
Last 3 incidents required cross-team escalation because ownership was unclear.
→ Assign explicit service ownership; update PagerDuty routing
MED
Deployment rollback procedure not documented
INC-2847 demonstrated the gap; rollback worked but took 13 minutes to initiate.
→ Add rollback runbook with 5-minute SLA target
LOW
Postmortem action items 40% completed within 30 days
Action items created but not tracked to completion. No owner assignment on 6 of 15 items.
→ Move postmortem tracking to Linear with weekly triage
Projected impact: Closing top 3 findings estimated to reduce on-call escalation frequency by 62% and cut MTTR by 24 minutes on average · Full action plan with owners and due dates included