Solving Reliability Fears With Service Level Objectives (Google) | Datadog

Solving reliability fears with service level objectives (Google)


Published: July 12, 2018
00:00:00
00:00:00

What worries you?

Liz: Thank you so much for being here today.

It’s really great to talk to so many people who are passionate about monitoring.

So to start off, I wanted to kind of query people a little bit and maybe get some audience participation.

What things worry you about your reliability?

What worries you about being in a public cloud?

Shout a few things out.

Audience: Hackers.

Liz: What’d he say?

Audience: Hackers.

Liz: Okay, you’re worried about hackers?

What other things worry us about our reliability?

Audience: About clouds.

Liz: You’re worried about clouds?

Yes.

You’re worried about the reliability of your underlying public cloud.

What other things are people worried about?

Audience: Regulators.

Liz: Regulators?

Yeah.

How do you prove that your service is operating the way that it ought to be?

But a lot of the things that we get from customers are a lot more basic than that and they boil down to the question of, “How do we measure reliability?

How do we know…is our service working correctly? Is it making our users happy?"

Site reliability engineering at Google

So we’ve been struggling with these things for a while at Google and the approach that we came up with is called Site Reliability Engineering.

And it’s the idea that we engineer our services using data rather than guesswork, and that we specifically use engineering tools like software engineering in order to handle the operations of our services

And this means that we need to have people on our teams who understand measurement, who understand writing automation, and who understand making sure that our services are automatable and scalable and reliable.

And not only that, but that we’re designing things to be architected correctly from the start rather than bolting on reliability after the fact.

So Kristina and I are here today from the Customer Reliability Engineering team, but we have backgrounds in the past working as site reliability engineers for varying different Google services.

In my particular case, I spent 10 years as a site reliability engineer at Google, working on eight different teams all up and down the stack from low-level club storage such as Cloud Bigtable, all the way up to Google Flights and other consumer-facing products.

Kristina: And I’ve been there for nine years, originally as a software engineer in display ads performance reporting, but then for the last five years in SRE, for most of that time, I was working on data integrity across the company, but now I’ve turned them on customer reliability.

Liz: So why is Customer Reliability Engineering a thing that Google is interested in?

And the reason boils down to the fact that we want people to be successful and empowered when using public clouds and feel like they understand what’s going on with their services.

Are they getting an appropriate level of reliability?

So today we’re going to help perhaps assuage some of those fears by talking through four different things.

First of all, we’re going to tell you what a Service Level Objective is.

Second, we’re going to tell you how to define effective Service Level Objectives that measure whether your users are happy.

Third, we’ll talk about how you actually enforce Service Level Objectives and actually make it a working control loop.

And then finally, we’ll talk about how you can use error budgets and SLOs not just a punitive measure, but instead as a way to empower your product developers to move as fast as possible.

So, Kristina, what’s a Service Level Objective?

Kristina: Service Level Objective is a performance target for a service and it’s the baseline for bounding an error budget.

The key realization that Ben Treynor Sloss (the founder of SRE Google) had, was that 100% is the wrong reliability target for pretty much everything.

Why?

Because increases in reliability are going to cost exponentially more as you attempt to approach 100%.

And those costs and resource expenditures are gonna slow you down and monopolize your resources.

And what are you getting for it but diminishing returns for your users?

And this is where error budgets come from.

These familiar numbers help us describe the consequences of setting different availability levels.

For example, for three nines or the nine…99.9% that’s going to allow us 43 minutes of downtime per month.

But to use it as an error budget, first, product management, and SRE need to establish the availability target they’re going to use.

Then the downtime allowance for that target as described on the table we just looked at is going to become your budget of unreliability or your error budget.

Then you need monitoring so you can tell how close you’re actually getting to that target and what your real availability level is.

And then the difference between that actual performance and the target performance represents how much of your budget remains to be spent.

Now, we have actual numbers, real data that we can use to drive our reliability control loop, throttling our velocity versus reliability investments to balance our budget.

And where does SLO come in?

Well, we’re going to take our availability target and put it through a thesaurus and turn it into a Service Level Objective.

And then the availability metrics that we use to base that monitoring we needed are going to become our Service Level Indicators or SLIs.

Here, we also mentioned SLAs, which is Service Level Agreements and those are an important business extension of providing predictable reliable services to your users, but for this talk, we’re going to focus on SLIs and SLOs.

Setting user-focused SLOs

But how can we set happy SLOs for ourselves?

First one, you need to be able to measure what makes an interaction successful enough.

It’s going to need to represent the user perspective.

For example, say the user is trying to load your portal page.

If they’re having trouble, the user is not interested in whether the trouble is because your load balancer is misdirecting your traffic or your cache has gone cold or your database is overloaded.

The only thing that matters from their perspective is, “I want it to load your page, but it didn’t work fast enough and now I’m sad.”

So we want SLIs that can quantify overall user happiness.

What kind of metric then can I use to build a good SLI?

So once I look like the graph on the left here, are not going to be good for us for this purpose.

They probably looked familiar because most of the metrics that we use to represent the internal state of our service, like CPU usage or queue length, look like this and they’re generally noisy and they’re not actually representative of what the users are seeing.

The kind of metric we’re looking for…oh, dear.

The kind of metric that we’re looking for…is going to be a lot smoother than that, it’s going to linearly correlate with user happiness and it’s going to be aggregated enough…

allow us to aggregate over enough of a window to smooth out any interfering noise.

So now let’s look at the signal from these two metrics as a hypothetical of our perspective on the outage period highlighted here in red.

The bad metric is showing a downward slope during an outage, but the high variance it displays is obscuring when the incident actually started.

It means that the value of the metric has a lot of overlap between what normal operation and outage conditions look like, and it’s not directly representative of what the user was seeing.

On the other hand, the metric that we want shows a dip that directly correlates with the timing of the outage and the much more stable value means that trends are visible and meaningful.

What happens though when I try to categorize it by setting a static threshold to tell me what is successful and what isn’t?

For the bad metric, there is no way to do that.

Any level that I set is going to incur a large risk of false positives or false negatives.

For the good metric, thresholds are meaningful and the placement of that threshold is not going to be artificially limited by the shortcomings of the metric itself.

Well, that was a very idyllic picture of what a good metric might be, but how can we translate ideals into real SLIs?

So, of course, the details of every service and the many ways it can fail are going to be complex, but there are, luckily for us, only a few kinds of dimensions that users really tend to care about and for each of those that applies to our service, we’re going to need to measure the proportion of successful experiences versus total experiences.

For a request type of feature such as a synchronous interface or an interactive webpage, a successful enough interaction usually means the response was served without error, fast enough, and complete and correct.

For more of a data processing feature, such as a feed tag processor or a cumulative report generator, successful enough is going to mean providing data that is complete and correct enough.

And then depending on factors such as whether it was streaming, processing or batch processing, it might also need to be fresh enough or processing from end to end fast enough.

And then for anywhere we’re storing our users’ data, a successful enough interaction is going to mean that the user was able to retrieve their data.

And as we said upfront, we’ve managed to describe all those as proportion.

This is so useful because representation as a percentage, where zero is bad and 100% is good, is very intuitive.

It makes it very easy to reason about, especially when trying to set what my target is going to be and understand how that’s going to shape the size of my error budget.

And it gives us consistency across our SLIs so that you can understand them quickly and you can build tooling and consoles that display and handle them consistently and in an expected way.

And also the simple type of calculation required to make this is the kind of calculation that’s supported by many monitoring tools so we can actually implement it.

Focus on one to three SLIs

Now we have the SLIs that we needed for correlating what successful enough interactions are with user happiness.

So we can now measure everything, right?

Well, perhaps, but having too many SLIs is going to introduce distraction and complexity and it can lead to having confused or conflicting priorities when you’re trying to respond to an incident or decide where to invest your resources.

So we recommend just one to three SLIs per user journey.

And in this case, a user journey is a process experienced by the user from a particular interaction.

So you might have web interactions versus mobile interactions or consumption activities versus upload and data manipulation activities, or you might have different types of users.

So you can have end users and content providers and also downstream internal services.

Each of those journeys is likely to have different critical paths and need its own distinct SLIs, but for each one, we recommend limiting yourself to one to three.

“But my service is really complex.”

We hear that a lot and we experience it a lot ourselves, but that’s exactly why we recommend one to three SLIs.

So don’t be afraid to prioritize which user journeys are critical for your service and which aren’t important enough to warrant their own SLIs and SLOs.

Optimizing SLI coverage

Helpfully, there are a few ways that we can prune our SLIs and broaden them so that we have sufficient SLI coverage.

For instance, you can aggregate conceptually similar journeys.

In this place, for example, all of the activities here could be considered as variants on a single browser store journey.

But the caveat is that they all have to be about the same magnitude to stop dominant activities from masking problems that might be seen in the less dominant ones, but as long as that’s true, we should be able to put them into a single category together or we could try and bucket together similar thresholds.

For instance, if you have a lot of similar-sounding latency SLIs, you could consider grouping them into categories instead.

Often, several variations on loading a page or querying an API can be bucketed into a single interactive category, while data modification interactions could be grouped into a different right category with a longer latency threshold.

Specifying SLIs

When we specify SLIs, we strongly recommend distinguishing between the specification form and implementation.

The specification is a simple statement of what interaction you were trying to measure.

For instance, the profile page should load successfully.

And these broad statements are useful for maintaining clarity and focus of intent, but they leave a lot of ambiguity as to how we’re going to define and measure this.

So what do we mean when we say “successful?” And at what point in the life of the requests are we going to measure that truth?

That’s where the implementation comes in, is going to allow us…let us take that specification and work out all the details of what that really means.

For this example, we’re saying that the percentage of HTTP requests for specific URLs are going to have specific statuses that indicates success and measured at a specific point.

But we need to understand the tradeoffs regarding each of those choices.

For instance, why are we measuring at the load balancer?

What if we measured at the browser or the server, which of those is most practical for us and why did we choose this?

What about what means success?

Here, we’ve said that 400s are successful and the reason for that is that if a user types in a bad URL, then serving a 404 to them is the correct and successful behavior.

But there are certain other kinds of service behaviors where that might be considered a problematic error that you want to track and so you’d have to be careful when you decide what means success.

In our latency example here, we’ve chosen one second as the response limit.

Why?

Do we know what makes that good or bad?

Is that the right limit?

And what about using a prober?

Is that the right thing to do or do we need to use some kind of sampling on actual user traffic?

Each of those choices are going to shape the efficacy and complexity of your SLI and with so many tradeoffs to consider it’s easy to get mired in the details.

But the specification, it helps remind us of what we were originally trying to measure and that’s why we make sure to keep that separate and maintained.

Connecting SLOs to user happiness

So now we’ve got some good SLIs, but now we need to figure out how to draw the line that represents user happiness.

Our SLOs are going to need a target and a measurement window.

For example, for this availability SLI, an example SLO you might place would be that looking back over the past 30 days, 99.9% of requests must have been successful.

Well, how did we choose that 99.9%?

We’re going to need to set targets that the users and therefore the business actually need.

We need to make sure our proportion of successful interactions is high enough that our users won’t be leaving us in high numbers.

On the other hand, making the service too reliable will tie up too many of our resources, providing reliability that isn’t actually adding additional value.

Achievable vs. aspirational SLOs

So setting that initial SLO may seem intimidating, but the target you choose it first does not need, and in fact, should not be considered set in stone.

The user expectations are going to be strongly tied to your past performance.

So we recommend looking back at your services historical performance and basing your initial target on that data.

After you have some experience with your SLO or perhaps you’ve had time to run a deeper business analysis, you may find that your ideal SLO is actually higher or lower than your current performance level.

In that case, your current target, it becomes your achievable SLO, whereas your ideal target is your aspirational SLO.

If the aspirational SLO is currently out of reach, you’re going need a technical plan for how you’re going to reach that goal eventually.

On the other hand, if you’re actually offering a higher level of reliability than your aspirational SLO, you’ll need a plan to gently transition from your current level of service to your ideal level to avoid shocking your users.

So eventually those plans should allow you to merge the aspirational and…sorry, the aspirational and achievable SLOs.

But your service and your business are going to continue to evolve, so you’re going to need to revisit that aspirational SLO at least annually to adjust it to match your current status.

But, Liz, how can we enforce those SLOs?

Error budgets for SLOs

Liz: That’s a really great question.

So, I think that it’s really important to think about, “What is the contextual performance?” And relate it back to the SLO.

So for instance, if I were to ask a question of, “Is it okay to serve 30 minutes of 100% errors?

Or is it okay to serve 1% errors for five days?"

Well, that depends upon my error budget and how much of it remains to be spent.

Let’s take this set of graphs which represent in varying ways our error budget and our actual performance to give you an idea of how we might think through these questions.

So, these graphs represent the same set of days, the same set of 30 days.

And here it shows two different events that happened.

In one event, we had a significant outage where on that individual day, we were only available something like 99% of the time, when our target goal is to be at about three and a half nines.

The other outage that you can see, that’s in the lower right-hand corner, is a long-term degradation, where we were supposed to achieve three and a half nines, but we’re only achieving roughly three nines in that interval.

And we are slowly burning through our error budget over the course of multiple days and weeks until we could resolve the problem.

If you take a binary view of it, and look in the lower left-hand corner, it looks like on paper the slow burn was significantly worse than the short burst of outage.

And that may not actually be true when you actually look at the numbers.

So we wind up instead doing a burndown analysis.

So we wind up doing an integration over time to figure out at any given moment, over the past 30 days, what was the average performance up to that point?

And we can see from that large outage, that we had a major set of errors that resulted in us burning through our error budget.

However, you can see that even after we recovered from that, the slow burn prevented us from regaining that error budget back.

Because even if events were rolling off the end of the error budget, that we were spending our error budget faster than we were recovering it.

So this kind of a burndown chart can really help you visualize how much of your error budget is left, how long is it going to be until you’re back on track, and give you kind of that integration over time, that the SLO window requires us to do.

Slow burn vs. fast burn events

So, as you can see, when we have these kinds of events, quantifying it in terms of numerical impact can really help us make these decisions about what is important to our service and not.

And this relates to the phenomenon of slow burn versus fast burn.

The idea of a fast burn event is that we’re consuming the effective equivalent of weeks of our error budget in a matter of hours.

Whereas in a slow burn event, we have the long-term rate of burn being too high.

That we’re consistently performing slightly below the performance that we expect from our service.

How should we react to these two different scenarios?

For a fast burn situation, it’s important to be very responsive because if you have overspent your error budget very rapidly and that pattern continues, you’re going to wind up blowing your entire error budget before your responders can actually do something about it.

However, the slow burn scenario is a lot more tricky to think about.

Why?

It’s because of the fact that you need time to evaluate whether something that is slightly below the level of performance that you expect is actually a real problem or whether it’s statistical noise.

The other facet of this relates to the urgency.

That because it takes weeks to measure or days to measure, it might be okay to wait for a weekday for someone to respond, file a P1 JIRA ticket, right, instead of having to wake someone up, because of the fact that it would be okay for it to persist in another day or two, but it’s probably not okay to let that behavior continue for multiple months.

So how does this actually translate into setting numerical thresholds for our pages?

So this first graph that I’m showing you is how we think about fast burn alerting.

The idea that we do this integration over time to look at how much of our error budget did we spend in the past hour for instance.

And if we wind up spending seven times the error budget anywhere between like five times, 10 times, 20 times, right, the amount of error budget spend you expect, you need to know right away so that you can start mitigating the problem.

And that’s sufficiently high of a threshold that random noise is unlikely to trigger that condition.

On the flip side though, for a slow burn alert, you need to gather data over multiple days in order to find things that are, say, 5% worse than you expect them to be in order to get a clear view of whether or not you’re going to meet your targets at the end of the month.

Assessing one-off vs. chronic issues

So what I’m getting to is the idea that SREs need to have a control loop.

That we need to have an idea of what circumstances are ordinary incident response, where we get paged for a one-off issue and we fix it, and when do we decide that the SLO is unsustainable and that we need to take more dramatic measures.

So the key to all of this is communicating with all of the relevant stakeholders.

And this is where error budget policies come in.

That, say, “What are you going to do when your SLO is in danger, when you are in danger of exhausting your error budget or you have spent all of it?”

So I’m going to talk through four different case studies of increasing severity.

In the first case, this is kind of operations as normal.

That we have a situation where we have a 99.9% SLO, and it says that it needs to be on a rolling 30-day basis.

So let’s suppose that we push a new release and it reaches 50% of the fleet before we discover that it has been serving bad queries to all the users that are on that release.

So that effectively is equivalent to 30 bad minutes of 100% outage.

Is that okay or not?

Well, let’s go back and review the math.

So, as Kristina helpfully demonstrated earlier in the presentation, if I go from 99.9% and go across to 30 days, it will helpfully say that there are 43.2 bad minutes that I’m allowed of 100% outage.

Or potentially if I already have brownouts that that window could be longer.

But it’s easiest to think about your error budget in terms of the minutes of 100% outage when you’re converting things.

So in this scenario, what do I have?

I have spent 30 minutes of my error budget.

There are 13 minutes left for the entire rest of the month.

That’s probably fine, right?

As long as I’m cautious, so I might want to do things like implement automatic rollbacks to catch issues earlier, or to canary deploy things at less than 50% for a while before I declare that everything is good and we should push it out to everywhere.

But let’s say that it happens again.

That this one-time bad push happens again.

The good news is we implemented longer canaries so the outage was caught when it was 20% rolled out to the fleet.

So we incurred 12 bad minutes.

Now, what is going to happen?

What should the SREs and product developers do?

There are a few things that we need to do now that we’ve spent almost all of the error budget.

The first thing of which is we need to understand what happened.

That it can’t just be the on-call person investigating, but instead we need a multidisciplinary team of SREs and product developers to implement more safeguards.

We also need to pause doing feature releases and eliminating other sources of risk in order to make sure that we don’t exceed our error budget and make our users unhappy until the 30 minutes spend that we incurred earlier rolls off the tail of the 30-day budget.

And we also can prioritize deploying a fix even if we have frozen doing feature releases in order to make sure that we’re mitigating the risk going forward.

Mitigating repeated SLO violations

But let’s suppose that we have a chronic string of issues, and it’s happening now not because of pushes, but instead because our application has hit some kind of fundamental scaling limit.

Like for instance, if we had a singular dependency on a MySQL database, and we stopped being able to grow that VM or that started becoming lock-contended, and more and more user queries are timing out and failing, so our SLO says 99.9% of journeys need to be happy.

But that now we’re failing 0.2% of our journeys.

So there’s nothing to roll back.

What do we do?

In this case, we need to start realigning the team’s priorities.

There’s nothing that we can immediately do about it, but we can do things like change the quarterly objectives to a lot more time towards migrating, say, for instance, from MySQL to Spanner.

In the short term, we may be able to do things like throw more machines at the problem if, for instance, we could go from a 32-core system to a 40-core system that might help.

And we can also reduce the other sources of risk in our systems.

For instance, we might decide that in addition to not pushing new features, maybe we’re going to be a little bit more conservative in how we handle data deployments.

But in the event that you have a system that is wildly out of whack, it really doesn’t make sense to constantly be waking people up, right?

If you’re constantly being alerted that you’re failing your SLOs and there’s nothing you can do about it, that’s really frustrating as an SRE.

So instead, that involves having conversations.

Maybe it winds up being the case that we can prioritize rearchitecting that we can focus on changing the application in the long term, but this was also a situation where we’ve discovered that our aspirational SLO and our achievable SLO are two different quantities.

And we may decide that we want to change the achievable SLO to a value that is going to detect severe breakage, but that we don’t care about being alerted day to day about only achieving 99.5.

But if these discussions break down, we may wind up having to decide that the SRE or platform team would be better served supporting a different service that is more amenable to interventions if we can’t agree about the priority of the reliability work.

So in summary, we had to introduce four distinct policies here.

We need to say that if the SLO is endangered, that we’re going to have slow burn and fast burn alerts that let us know, and that the on-call’s job is to mitigate.

That if the SLO actually winds up being violated or nearly violated over this rolling 30-day window, that we need to deploy a team of people that’s interdisciplinary to look at the problem and make sure that it’s thoroughly fixed.

But that if there are repeated violations, that it requires follow-up, that it requires reprioritization of the team’s goals, and that if they’re chronic violations that we may need to change our Service Level Objectives or negotiate about dropping the service.

Using error budgets to help teams move faster

So I’ve talked kind of about the stick aspect of error budgets, but I want to talk about the carrot as well.

Kind of how do we use error budgets as a tool to help product development teams move faster?

And I think that the best illustration of this is by using that kind of left to right view of left being less reliable to the right being more reliable, and kind of constructively building up, what does reliability look like.

So we can kind of think about our services as being constructed from a bunch of components, right?

Like we build our services, targeting a certain level of reliability, and then we improve them over time to be even more reliable.

And then there’s some amount of work we can do in order to prop things up and keep them running as optimally as possible so that they fall within the Service Level Objective range that we talked about.

And then it doesn’t really matter whether we get lucky or not.

If we get lucky, we achieve higher reliability, but if not, we’ve appropriately engineered our systems so that we can meet our reliability targets without luck.

And sometimes that works, right?

Sometimes even if there’s no luck, that we still operate within the SLO.

And what’s left is a range of errors that would be acceptable if they happened.

But if not, we can do cool things like experiment, right?

Like we can push releases out more often.

We can run A/B tests, we can do chaos engineering.

These are all useful things that we can do assuming that we’ve appropriately engineered our service so that with some construction and some maintenance, it meets our targets.

But what happens if there’s a defect?

What happens if the service is less reliable than we designed it to be?

So in that case, that amount of reliability goes away.

And then no matter how much firefighting we do, we aren’t going to be able to meet our SLO.

So in that case, we have to do the punitive measures I talked about before, right?

That we have to freeze.

But then we improve the reliability based on what we’ve learned and we’re back to acceptable errors or being able to do experiments and move as quickly as possible within those constraints.

Sustainable Service Level Objectives

So that’s how we think about the discipline of SRE.

That we think about, “How do we execute reliability improvements? How do we define the appropriate level of reliability or the SLO? And how do we do the maintenance in a sustainable method that doesn’t burn people out?”

So to conclude what we talked about today, is the idea that it’s important to define a Service Level Objective.

That if you don’t have a Service Level Objective, you’re not measuring what matters to your users and you’re going to be caught off guard when your Twitter account of your company blows up or when your customer support team’s phone is ringing off the hook.

That’s the worst possible way to discover an outage.

So it’s important to define any measure, even a simple measure, to define what matters to your users and measure it so that you’ll know whether or not you’re on track.

Once you have a Service Level Objective, you can make iterative improvements.

You can do things like add slow burn and fast burn alerts.

You can add policies and say, “What will you do when the error budget is in danger?”

And you can do additional things like measuring the more user journeys or being more sophisticated.

And you can even develop nested SLOs that govern relationships between different teams at your company, rather than only talking about the end-to-end experience of customers.

This is something that we think that is important for people to do and that the vast majority of customers that we’ve interacted with can do this and have done this work to implement SLOs.

It’s an iterative process.

You’re never completely done, but you always have something that is good enough that is going to meet your needs in the short term.

You don’t have to be in a situation where product developers and software engineers are arguing with each other.

You don’t have to wind up in a situation where feature velocity and reliability are at odds with each other.

Because it turns out that the right thing to do is declare what’s good enough and then empower people to move as fast as they can within the window of keeping reliability good enough.

So having SLOs is really the critical element of having well-functioning operational teams and SRE teams.

Resources for SREs

So there are a few more resources in case you’re curious to learn more.

And we do have cards to distribute, that kind of link to the book on the left, which is the “Site Reliability Engineering” book.

It’s available to read it for free online.

There also are two books coming out in the next few weeks.

“The Site Reliability Workbook” is a sequel to the SRE book that describes real case studies from GCP customers of how they implemented SRE.

And “Seeking SRE” is a volume that describes kind of extensions of SRE to varying fields, such as privacy engineering or to ethics.

So last thing, we wanted to insert a little plug for our team.

So we’re from the Customer Reliability Engineering team.

And like I said earlier, our job is to help Google’s customers move as fast as possible on GCP.

So this means that we’re experienced SREs who help customers with setting SLOs.

We help people review whether their applications are built to the appropriate standards to meet their target SLOs.

And then we do things like implementing shared monitoring between our customers and Google so that the support teams are on the same page about what’s business critical to our customers to help them resolve outages faster.

And we are there for you when it matters.

We do things like help retailers prepare for Black Friday.

We do things like joint postmortems when there are problems.

Because frequently it turns out to be the case that our customers can mitigate issues rather than waiting for GCP to fix issues that are going on on our end.

So these are all things that we think are important for people to do and that we’re available to help our customers with.

So thank you very much.

I think we have time for five minutes of questions and then after that, you can find us downstairs at the GCP booth.

So thank you so much.