Safer Deploys With Test-Driven Infrastructure (Starbucks) | Datadog

Safer Deploys with Test-Driven Infrastructure (Starbucks)


Published: July 17, 2019
00:00:00
00:00:00

Thank you very much and thank you, Jason, for the introduction, but also thank you Datadog for giving me the opportunity to speak and actually it’s closing out.

I think it’s been a very enjoyable conference. How about a hand for Datadog and everyone there?

Background

So I’ve been introduced, and why am I up here? Because I consider myself an Ops professional and I’ve been in the industry for 20-some-odd years.

I started way back doing value-added reselling and selling Websense across South Carolina.

I spent some time at the state of South Carolina as a backup and recovery administrator, Unix administration.

That was the probably the low point of my career because as part of the State of South Carolina, we had disaster recovery testing.

How many have enjoyed the thrills of disaster recovery testing?

Exactly.

So 1,100 LTO4 tapes boxed up in a van driven to Philadelphia does not make for a good weekend. But since then I’ve spent some time at Oak Ridge National Lab, spent some time in the Bay Area working and consulting at a number of great companies, Airbnb, Pinterest.

The Ops life

But these days, I’ve landed at Starbucks. I’ve been there for about a year and a half.

I’m really enjoying the experience and the common theme across most of these roles is I’m jumping in new environments and getting started with environments that I’ve not seen before.

Everyone experiences that word, Ops professionals. We’re expected to just make it work and what are we making work?

We don’t know, we haven’t seen it before and even now in my position at Starbucks, it’s kind of an internal consulting position because I’m helping out individual teams with their Ops challenges.

I don’t necessarily know their infrastructure.

If I’m lucky, I’m given a diagram.

More often than not, I’m sitting with that team for a week or two to figure out what’s going on, what really matters, and what’s important.

And so often, the experience is this.

It’s jumping out of a perfectly good airplane to go fight fires and to put things out and hopefully bring some sanity and rain, some coordination and control to whatever that burning landscape looks like.

I think we’ve all been here.

I think the rest of the industry actually calls this imposter syndrome, but you operate in your environment, you learn it, you survive, you endure, and as time goes on, things get better.

So this is a lot of our Ops experience.

Should it be our Ops experience?

So from that 10,000-foot view, this is how I answered the question of “What should I be doing as Ops?”

What does it take to do this though is a very large question because again every experience that I jumped into, every company that I’ve jumped to, has a completely different array of instrumentation, different systems.

They’ve got different applications they’re running, and so the mix of scales that it takes to bring all this together quickly balloons and becomes some Cartesian product of figuring out: “Okay, what am I doing here? What am I doing now? How am I applying all these Ops skills to make this thing better, to make this thing work?”

The one thing that it really has encaptured is the fun part of our jobs, that is popup quizzes and so at 3 a.m., we get some alert saying the site is down and we’re debugging across an environment trying to figure out well, what’s working and what’s not working?

How do I check this?

I thought this was working yesterday.

Was there a change that maybe upgraded the load balancers?

Was there a change of the application version?

I don’t know.

Let’s run some crawls across the environment.

Let’s run some other commands just to figure out what’s going on.

Knowing the environment is absolutely critical.

So here’s a survey for you.

How many of you, seasoned professionals that you are, can draw your infrastructure stack from the top of your head if I were to pull up a whiteboard right now?

Draw your stack.

Oh, so some people already struggling with this. I experience it every day.

No worries, but, separately, of those people who had your hands raised, how many of you could me take me back to your documentation wiki and point at a diagram that actually reflects that infrastructure?

Yes, gold star right here.

Know yourself, know your environment

But for the rest of us, this is a large-capacity room and the documentation isn’t there. The overall sitemap isn’t there. The orientation and the landscape of what we’re doing.

We’re having to wing it.

We’re having to figure out as it goes along and we don’t just need this at 3 a.m.

We need this every day, day in, day out as we’re making controlled changes in our environment and making controlled changes, knowing the current state, making sure that it’s doing, that everything’s A-Okay as we’re working on this because the worst thing we can do as professionals is to make a change and not know how it was impacted.

So we test our current state. We ask ourselves what that change is gonna do to our environment.

We actually go ahead and make the change and then, we start to validate: “is that change actually working how I want it to?”

And then boom, we only close it out once we actually know and so all the skills distill down to what I experience as an operations professional on a daily basis.

I started with the analogy of smoke jumping in, a 10,000-foot view jumping out of an airplane. I like starting off with that big-picture approach because you see the landscape of what’s happening.

You’re jumping into an area, you start working on it.

And having that 10,000-foot view gives me a good perspective on what systemically has to be done in this environment and so I’m gonna take it aside just briefly and call out this good work.

This is from Chick-fil-A’s tech blog, Caleb Hurd and Laura Jauch, if I’m getting their name right.

Caleb, a sharp engineer, is actually a manager for Chick-fil-A SRE and created this distillation and what I like about it is it tries to answer that same question of as Ops, as SRE, what am I trying to do?

Let’s break it down to some goals and let’s break it down to some tactics of here’s day-to-day operational stuff we can be working and improving to make our infrastructure and environment better.

I really like this structure because it does that, but it also gives us a chance to step back, reflect, and ask those bigger questions and what excited me about first encountering this diagram was because I immediately started asking the question of what business measurements, what business metrics, are we capturing to reflect that our goals that are actually being achieved?

So I sent off a note to Caleb and we’ll see if version two comes out.

The value of DevOps

And so we’ve taken a good look at Ops and the buzzword du jour is DevOps and Devs are those wonderful image engineers that create all the features that our sites do use to attract customers—which is a good thing because if you relied on me for those features, you’d be getting a 1995 HTML page with a burning flame across the bottom.

I don’t do Dev, but we’ve brought together, we have DevOps, and all is right with the world. Obviously because we’re at a conference talking about this stuff.

No, my common experience is this, which is an improvement from last week because last week it was 47.

So things get better day by day, but we have this conversation, DevOps, back and forth, sharing knowledge tactics, concerns, and having this, again, fruitful conversation around “What can we do better? What can we make different?”

And so to take a moment to reflect, we now practice on the Ops side software development life cycle as far as our infrastructure.

We use Jenkins to make sure that every time we’ve got our Terraform or every time we go to make a deploy, it’s done in a consistent and repeatable manner.

We know exactly where it’s happening from.

Terraform is tracking that state file for us, keeps them in a good place, takes care of that, and pushes it out, but it’s a life cycle.

So we’ve got these controlled changes we’re making through time.

We also have this idea called immutable infrastructure.

Let’s containerize our applications as much as possible.

We started with Packer and AMIs and so our blue/green deployment said, “We’ll spin up a new batch of AMIs and APIs, switch the load balancer, move on.” And because we now have contained units of what our application server is doing, we can reason about it.

We can know that while version 0 of the container and version new of the container, this is the only difference because we’ve got those version numbers, we know what that should be, we can expect that, and we can track it, and move forward.

Separately, version control was my favorite.

It’s a time machine and not only do I see what changed through time, if I need to go back to that previous version, it’s right there.

It’s waiting for me, and one of these days, one of these days, it would be so good as to be able to git bisect my infrastructure as code, but I’m not quite there yet. But infrastructure as code kinda compiles all the stuff together.

Let’s go ahead and make sure we have a declarative configuration, so that our tool takes our intent and goes out and applies the thing—or separately freezing Ansible deployer infrastructure.

We need to declare our imperative steps of “do this, do that,” and we’ve got repeatable, organized, understandable, readable deployments that we know what’s involved.

We don’t have junk sitting under somebody’s desk.

What is test-driven deployment?

And so it’s been a very fruitful conversation, but our developer brethren have tons of great ideas and I don’t think we’ve exhausted the entire conversation.

I think there’s more to learn and one idea that I think is ripe for plucking is test-driven development.

How many are familiar with test-driven development?

A good amount.

For those who may not be as familiar, the mantra is red-green refactor.

Write a failing test. Stand that up, watch it go red. It’s not gonna pass. We haven’t written our code yet.

We’re good, but because we have a test now when we right that first code, we run it, the test now passes.

It’s green, and we build ourselves a little bit of a safety net, a little bit of infrastructure, so that when we go back and refactor that change, we can watch the performance.

We know we’ve not regressed.

We know we’ve not lost any functionality because that test is still passing.

And in practice with Python, we’re writing in a test.

It’s gonna fail the first time we run it.

Secondly, maybe we write a stealth function to build things out.

That’s still going to fail, but moving on, we write our first version of the function.

“Hello world,” great. It returns “Hello world.”

We’ve done the bare minimum necessary to make sure that our code is doing what it should be doing. Success.

Let’s go have a beer, but sometimes we need to do a little bit more before you call it at the end of the day.

So you think, “Well, what am I gonna be doing in the future? What am I gonna care about? Let’s go ahead and build in a little bit more.”

So hereon, I extended the function to just be a little bit language-aware, so that when I come back the next day, boom.

I write my second test, test it in English.

It just happens to be default language because I’m a default English speaker, but then also I add some capability for Spanish.

Again, it’s still failing because we haven’t actually written the code yet.

Write some code to get it out there. Okay, great.

Now it’s green, and the test is working the whole time, making sure that as we’re making changes.

So with this idea, adding this infrastructure, it’s an investment.

We have to take extra time to write the code, to set it up, to make sure everything’s working.

It’s pain in the butt.

I just need the things to work because there’s beers going on and I wanna get out to the party.

What are we buying with all that investment?

Sanity, speed.

Without tests, you’ve got people going crossways across traffic.

You’ve got people mixed in with vehicles.

You’ve got different rates of speeds.

You don’t have control.

In a tested environment, you’ve got guardrails.

You’ve got appropriate lanes for traffic to go in.

It’s much easier to organize, and traffic as a whole moves much smoother.

It’s a good idea for devs.

Is it a good idea for infrastructure?

Well, implicitly, I’m going to say yes.

I’m the speaker. I’m up here.

That’s what I chose to talk about, but how do we make this happen?

Well, our framework is still the same.

We write our test, write just enough infrastructure to make it pass, and then, we have the freedom to refactor if necessary.

What sort of tests do we need?

But what tests? How would we even begin to generate tests? What kinda tests are we looking for? What kind of code do we have to write?

We already know what kind of test, because we’re operating these kinds of tests on a regular basis. As we’re making our control changes, we’re already doing this somewhat.

We don’t have the infrastructure in place, we don’t have the testing framework in place, but if we’re making disciplined changes, we know what the infrastructure is.

We measure that infrastructure before we make our change, we make our change, and then at the end of it, we test to see that that the change is doing exactly what we need it to do.

So here, we’ve got crawl statements peppered throughout it. Let’s apply that to our infrastructure.

So before we even write up our first Terraform deployment for our infrastructure, great, let’s see if we even get a result.

I make a curl crawl. It’s gonna return 404. That’s good, very fast.That’s kinda nice.

I’ve got speeding on infrastructure, but it’s ultimately failing because it’s not there.

Let’s set up an NGINX server, maybe do the deployment behind that, and then immediately I can see, “Hey, look, it’s now operating at 30 milliseconds.”

Oh, maybe that’s fast, maybe that’s not, but more importantly 200.

Our infrastructure is doing exactly what it needs to do. I don’t have to worry about it as an Ops person.

The developers are now free to go off and create their own test environments to see it working for themselves. There’s bliss and sanity in the world, but when they need to come back and refactor that and say, “Well, users are never gonna put up with 30 milliseconds,” great.

They have the freedom to refactor, they can improve that scenario, they can improve that situation, and in the simulated example, they drop the response time to 12 milliseconds of progress.

I would like to think I’m the first person that says test-driven development is a good idea.

Obviously, I’m not.

We have this in the configuration management world.

Puppet has our spec, does its thing.

It’s nice here in this example. We’re testing is NGINX out there? Is it available? Is it doing what it needs to do?

And it works. It’s nice.

Chef does the same thing.

Excuse me.

Chef does the same thing. It uses the language called InSpec, previously ServerSpec, whatever.

We have evolved.

InSpec, again, in this example, we’re doing much the same thing, but it’s something at a very low level. We’re checking the package. We’re checking the port.

Well, not really, we’re just checking that it’s up and it’s running.

We’re actually not testing the content that’s happening at all, but I give the Chef example because, again, I do feel its progress.

I particularly appreciate in the Chef environment that Test Kitchen integrates everything very nicely and gives a more cohesive experience in building.

Using the GOSS daemon

Separately, moving on, a tool you may not have heard of that I’ll draw your attention to is GOSS.

It’s a small daemon, it’s written in Go, and it has a couple of features that I find particularly convenient.

In a base mode, it does GOSS validate.

It takes its configuration file and does a one-time check of “Is this infrastructure operating appropriately?”

Yes, it is. No, it’s not. So a one-time check.

Separately, it can use the GOSS daemon in a serve mode and then it stands up and turns an HTTP endpoint, constantly runs that validation check, and so you can point your monitors at it and get a long-term idea of what’s happening and how that’s working.

Separately, a convenience that I’ll note here, GOSS also has the feature of…you can point it at this missing process and it captures the parameters of that process such as port numbers, PIDs, process names, system de-statements.

So whether or not it’s enabled true, it still can actually auto-configure itself when you point at it.

So getting started with GOSS is very easy to do.

So we have tools to unit test configuration management and so this is good stuff. Our systems will provision more accurately.

I’ve got a framework so that when I get to refactor this stuff, I have it regressed. I know I’ve still got the same functionality that I want, but let’s try this Litmus test.

If I’ve got broken infrastructure at 3 a.m., can I run this tool to help me identify what’s broken and what I need to fix?

Sadly, no. These tools are great, they’re awesome, but again, there are how many of the configuration management layer?

If I’m doing any sort of image building, then that’s probably happening even before my image gets baked and it goes into production.

So it’s a one-off kind of feel.

Test-driven production for Kubernetes

Separately, let’s look in another area, Kubernetes. How many people operate Kubernetes?

Decent size.

How many people actually have Kubernetes in production?

Wait a minute: there are more hands for Kubernetes in production than for Kubernetes.

Maybe I woke somebody up with a question, so thank you.

But if you’re using Kubernetes, hopefully, you’re familiar with liveness probes.

Okay.

Do you have one implanted?

Okay, a small number.

Readiness probes as well?

Okay.

So for the rest of the room, the idea is simple because Kubernetes is a container orchestration environment. Our containers are bundled up in pods.

As Kubernetes is putting those pods into production, a liveness check is a constant check that Kubernetes can make to say, “Hey, this thing functioning as it should be.”

Because of that, if it’s not functioning as it should be, the orchestration system says, “Well, I’m getting a review. You’re broken and we’ll go ahead and spin it for replacement,” which is nice.

A readiness probe is a very similar idea.

It’s written very much the same way, but the mechanism is a little bit different.

Rather than killing it after it’s live, it has a delay.

The readiness probe has a delay before it actually starts directing live traffic to it, so we’ve got a good deal of control around what’s healthy and what’s getting good traffic.

In this particular example, we’re using the healthz endpoint, we’re using just an HTTP check.

You can use command lines and so just wrapping this up for an earlier tool, you can go ahead and wrap it up with GOSS and then get a more full-featured check around “Is that application behaving as I want it to?”

And I’m calling this out specifically because if you’re migrating old applications that existed as stand alone infrastructure, you can capture that with GOSS because of its auto-discovery, containerize everything, and then run your GOSS check and you can still have a like for like.

“Hey, it was working before, my tests were passing.

Now that I moved over here, my tests are still passing and I’m good."

So I like that safety net.

So we also have tools for functional testing and I’ll go back to the litmus test I used before.

If I’ve got broken infrastructure, can I use this to help identify what the problem is?

And now answering this question is a little bit tricky because to a limited degree, yes, it’s fixing errors before I even know I have them.

If I’ve got a failed pod, it’s replacing them, but not in all cases—well, it’s fixing them in all cases.

I’ll still have a whole classes of errors that aren’t gonna be addressed by this approach at all.

Pod restarts, so if I’ve got a crash loop, it’s gonna keep on crashing.

It’s gonna get the count.

I’m not putting bad pods into production, but depending on what’s happening, I can still resource hard myself or I just lose my total pods. And so as an approach, we’re maturing, but I think there’s still more mileage.

So stepping up, smoke test, right?

As part of our Jenkins deployment relying on some of that continuous development that we’ve got going on, let’s write, as part of our Jenkins file, a validation test.

And in this particular example, I’m using the InSpec pod to check, at deployment time, is the application, now that it’s deployed, running as it should be running?

And here, using the HTTP source, going ahead and fetching that main page as it should be, capturing that down.

I’ll pass it to some host parameters just to get a feel for if I’m trying to understand is this overall application in a healthy state of deployment time?

I’ve got to check for that.

Now, the approach is going to be a little bit error-prone here, especially, wherein I’ve written the test at InSpec for convenience.

If I ever need this instruction during a troubleshooting scenario, I’m probably not gonna run InSpec because I was running it in a pod, setting up the environment.

It may or may not be convenient.

A little bit more portable is Curl but now all of a sudden, as far as maintaining the stuff, I’ve gotta make sure my Curl matches my InSpec in all cases, so maybe I should just run the thing in code to begin with, but I don’t know.

Can you use this test to troubleshoot?

It’s a good approach, but just because of tooling, it just doesn’t seem to come together quite how I want it to.

So again the question is, can I use this kind of test to troubleshoot an application if it’s broken?

And yes because I have the test encapsulated. I know what I’m trying to check for.

I’ve documented what the application or how the application should be checked. But it’s still not complete because inherently, this check is only running at deployment time and, hopefully, or more likely, my bug hasn’t happened at deployment time.

If it has, then I’ll manage it, I’ll back it out, but at 3 a.m., hopefully, nobody’s doing a deployment, unless you have colleagues like some of mine.

But it’s not happening at 3 a.m. and so this test hasn’t run and it’s not returning useful information back to you.

If you know it’s there, you can go grab it, run the test manually.

But then you’re still testing one component of your entire application stack and so you’ve missed all that other stuff going on. And you want these tests to run continually.

You want them to be running so that errors are caught and that information is returning to you as quickly as possible.

So we have some tests and we know what we want them to look like, so let’s forget everything…

Well, not everything I’ve said.

Let’s forget about the past five minutes of review material. Imagine that we want this test-driven development thing.

We want this red-green refactor. We know we’ve got some workable tests. How do we make it happen?

So maybe in an ideal environment, what I’m doing is defining a 200-response count.

Just basic is my site returning mostly healthy responses?

It’s what I want it to look like.

It’s a test. It’s gonna fail the first time. How do I write just enough infrastructure code to make it pass?

And here’s where Datadog comes in. It helps me out quite a bit. If you use the Datadog Python package, it’s got a component, dogshell, and I find this quite convenient. It gives me a command-line access to most of my Datadog components.

So again I had a 200 results count monitor that I’d already created here. Dogshell will actually allow me to dump that.

Why is that useful?

Well, if I dump it, I can stash it in a file. In a file, I can put it in my version control repo, and you’ll have that changing over time. I know how that’s improving over time.

More importantly though, what I really wanna do is put that automated check right alongside my microservice code. And then, every time I check in that repo, every time I make updates, let’s automatically deploy that check and make sure it’s consistent.

And so conveniently, there’s now a feature in dogshell to go ahead and update that from a recorded JSON snippet and so how would that work?

Well, let’s go ahead and take that JSON, dump it out, put in a mono file with a bit of our repo.

So is a just, hopefully, getting started repo.

The developer said, “I needed a microservice,” and I said, “Yes, yes, you do. Go out and write it.”

And so the first thing he or she’s gonna do is write a Docker file. She’s gonna step out our Jenkins file and then, I ask, “Hey what do you need this microservice to do?”

And then we sit down and we write a monitor file to say, “Oh, okay. Well, let’s make a check and let’s look over the last five minutes, response volumes, and we want them to be over 95%.”

Maybe that’s a good number.

Maybe that’s a bad number.

I just wrote a test.

I don’t really care at this point.

I can come back. I can revisit it later, but I’ve got a definition of some amount of health for that microservice and deploying that in my Jenkins file was also relatively straightforward.

I can go ahead and add a stanza to my Jenkins file that just says, “Hey, go ahead and run that dogshell command,” and get it there.

So, here, I focused on the test, meaning maybe it’s accurate, maybe it’s not.

What are good tests?

Again, I started with the observation that jumping out of an airplane, I’ve got a nice 10,000-foot view.

I would have said that, as an Ops guy, I wanna make sure the user’s happy.

So for me, when I started asking questions around what tests should we be writing, I come back to you: what do users notice?

SLIs and SLOs

And there’s some very great material in the Site Reliability Workbook, in the Site Reliability Engineering Guide from Google, that starts talking about this and they distill it down into SLIs and SLOs.

An SLI, if you’re not familiar, is just a metric or a monitor that represents the entire service well.

An SLO is the distillation of that into a policy.

It’s the goal we want for that individual SLI.

In my example, if I step back a couple of slides, you’ll see it, I was just checking response rate.

But best practices, latency, saturation rate, traffic rate, and error rate are all good things to check and so, I’ve distilled these down into a file.

I put it with my microservice.

It’s getting deployed alongside, and a next feature that Datadog comes in with is they’ve just released SLO monitoring with Datadog.

So, now, because I already know and I have identified from the start of the application code what’s important about that, I wrap that up with a conversation with the business around what do you care about?

How much investment do you really need for this microservice? Is it user-facing?

Are they gonna even notice when it goes down?

Are we justified in spending $10,000 an hour to keep it up?

Then they’ll say, “Ha, ha. No, that’s why you’re using open source.”

They’ll come back and they’ll say, “Oh, no, it’s probably appropriate at this level.”

It’s a healthy conversation, it’s a good level, it’s a good conversation, and we just codify what that appropriate level of investment is.

So from that, we create very nice dashboards that let us know if we’re inside that error budget or if we’re outside that error budget.

And that error budget is a very nice concept that takes that packaging of SLOs and then says, “Hey, if you’re operating in advance of where you think you should be, run wild. Have fun with it. You’ve got some time to burn.” Or alternatively, if you’ve had a lot of crashes over the past couple weeks, then maybe you’re trending under your error budget.

It reminds everyone that, “Hey, we’re burning through this stuff too fast. We’ve gone a little bit faster than we wanna be.

“Everyone needs to shift down, be more focused around stability and features, and just be more mindful of how we’re proceeding in our development and what features we’re choosing to deploy and how we’re choosing to deploy them.”

So I’ve got a good deal of control—or I’ve got more control around the pace of development and the pace of deployment.

So in this monitor file approach, I’ve got those checks that I care about alongside the application code, the microservices that I care about.

And deploying it in a consistent manner, I’ve now shifted left on all that discussion around what we need to monitor, why do we need to monitor it?

It’s happening from the start of the microservice creation and I don’t have to care as much about following this stuff all the way through.

I’m not building out the telemetry for the production system after it’s in production.

It can follow me through staging, through QA, and then into production.

And then at the start of go live, I’ve got pretty decent control around how that’s operating and what I’m seeing with that.

And so should it go down whereas in this case, I’ve got an error drop, which is kind of a bad thing.

But I’ve got an error drop on the application servers right here, which a quick glance between the different graphs say, “Ha, ha. Well, yeah, I’ve got error drops right here in the application stack because on AWS and then in the database, it returns relatively strongly.”

After this minor glitch, let’s go and check out the applications first.

Operability as a part of the design and development process

So why should we care? If you’re not an Ops person in this room, I totally get you.

But I will advocate that this kind of approach is good for devs, it’s good for Ops, it’s good for interns, and it’s good for managers because in seeing that infrastructure, by making sure that we get that infrastructure built in from the start, a sense of scale, a sense of performance, a sense of application, architecture is all cohesively built in and then as we move forward, the whole team understands it better.

So I would encourage you as an Ops advocate that operability should be a design consideration and to build it in from the start.