Every engineering team I talk to has SLOs. Most of them have SLOs that don’t do anything.

They sit in a Notion page somewhere, updated once a year when someone remembers they exist, and they have zero influence on day-to-day engineering decisions. Meanwhile, the team is running on instinct, deploying on Fridays, and finding out about reliability problems from their customers.

Here’s why this happens, and how to fix it.

The Confusion Between SLOs and SLAs

A Service Level Agreement (SLA) is a contract with external consequences. If you violate an SLA, customers get credits, contracts get reviewed, and lawyers get involved. SLAs are backwards-looking: they measure whether you failed.

A Service Level Objective (SLO) is an internal engineering target. Its purpose is to drive decisions before you fail. SLOs are forwards-looking: they tell you whether you’re on track.

Most teams get this backwards. They define their SLO as a slightly looser version of their SLA, then treat it exactly like an SLA — measuring it monthly, discussing it in quarterly reviews, and ignoring it the rest of the time.

That’s not an SLO. That’s an SLA with extra steps.

What an SLO Actually Does

A useful SLO does three things:

  1. It tells your on-call engineer when to act — and more importantly, when not to act
  2. It gives your team permission to take risks — when you have error budget, you can ship features; when you don’t, you slow down
  3. It surfaces the right conversation — between product and engineering, about where reliability investment should go

For an SLO to do these things, it has to be real-time, visible, and connected to your team’s actual workflow.

The Error Budget Is the Point

The concept that makes SLOs useful — and that most teams skip — is the error budget.

If your SLO is 99.9% availability over 30 days, your error budget is 0.1% of 30 days = 43.2 minutes. That’s the allowed downtime for the month.

The question isn’t “did we meet our SLO?” The question is “how much of our error budget have we spent, and on what?”

When you frame it this way, the SLO becomes a resource management tool. You’re allocating a finite budget — reliability risk — and making conscious decisions about where to spend it. Do you want to spend it on a risky migration? A big feature deployment? Or do you want to save it as buffer against unknown failures?

Making SLOs That Stick

Here’s what separates SLOs that teams actually use from ones that gather dust:

Define them close to the user. Measure what the user experiences, not what your infrastructure reports. p99 latency of the checkout endpoint from the load balancer is infrastructure data. p99 latency experienced by users completing a purchase is a user-facing SLO.

Make the window rolling, not calendar-based. A 30-day rolling window means your SLO is always measuring the last 30 days, not resetting at the start of each month. Calendar windows create perverse incentives (burn your budget at the start of the month, go conservative at the end).

Put them on a dashboard everyone looks at. If your SLO lives in a document, it doesn’t exist. It needs to live next to your deployment pipeline, your incident dashboard, and your sprint planning tools.

Connect them to your deployment process. When error budget drops below 50%, your CI/CD pipeline should require additional approval for deploys. When it’s exhausted, deployments should be blocked automatically. This makes the SLO real.

The Right Conversation

Once you have SLOs that work, they unlock a conversation that’s very hard to have without them: how reliable does this service actually need to be?

The answer isn’t always “more reliable.” Sometimes 99.9% is the right target. Sometimes 99.5% is fine and the engineering cost of that extra 0.4% is better spent elsewhere.

That’s a product decision as much as an engineering one. SLOs give you the vocabulary to have it.


The teams I’ve seen get the most value from SLOs are the ones that treat them as a communication tool, not a compliance exercise. Start there.