Changing the Calculus of Containers in the Cloud (Deepak Singh, AWS) | Datadog

Changing the calculus of containers in the cloud (Deepak Singh, AWS)


Published: July 12, 2018
00:00:00
00:00:00

Not one destination but many

I know how difficult it is to organize a first conference—I was at the first re:Invent—so congratulations and thank you for asking me to be here.

It’s been fun sitting upstairs and watch all the fabulous announcements and all the interesting discussions about how people actually reason about their infrastructure and about their applications.

I wanted to start off with a quote by Ralph Waldo Emerson.

I’m pretty sure he wasn’t talking about technology, but it applies, which is, we have a destination, but technology is not about one destination, it’s about many.

You are at a point in time, you’re trying to get to a place, you really need to understand the journey and the mechanisms that you put in place to get to where you want to get to.

And when you get there, chances are you find out that you have to get somewhere else.

It’s something that applies really, really nicely to the journey that we’ve seen in the cloud over the last decade or so, and in containers, in particular, over the last few years.

Where the journey begins

So, let’s talk about the journey.

I’m going to simplify grossly, but, for many of us, the journey is—we started off on-premises.

You heard James talk about his dark days in managing mainframes.

But, for most of our customers, the journey starts in a data center.

For many of you in this room, it may have started off on a cloud provider already, but for the majority of our customers, it’s still in a data center.

Twelve years ago, I still remember when EC2 was launched.

I’ve been in Amazon for a little over 10.

And it was quite magical, the fact that I could use a simple CLI call, which was around instances.

There was nothing. There were no fancy consoles, there was a simple SDK, but the fact that I could sit in my living room, I was running bioinformatics code at the time, and launch a bunch of instances to power through a bunch of protein sequences was quite magical.

Today, that infrastructure is a lot broader.

We have virtual machines, we have containers, and we have functions with serverless.

And for many of our customers, it’s not picking one or the other, it’s actually a combination of the three of them.

And they have become pretty good, and they’re getting better, at figuring out which part of the infrastructure runs in virtual machines or bare metal, which we have now at AWS, as well as in containers, which is what I’m going to talk about today, and functions.

Containers and microservices

So, let’s talk a little bit more about containers.

Probably not a surprise to anyone in this room, the reason containers— and when I say containers, I mean, Docker, in some sense, which became popular a few years ago, was not necessarily something that was completely and radically new. But the fact that it became pretty easy to take your application—and you may have had one, you may have had two, you may have had 10—to package them up, to simplify them, and get a very consistent toolchain that worked on your laptop. And you could take that same artifact and push it into production.

That was pretty nice.

It made things a lot simpler.

People also figured out that I could take multiple of these and run them on the same machine and package them up.

Again, nothing new. There are customers who’ve been in doing it for years.

But the fact that you had a toolchain that made it simple and didn’t have to build your own was, while not radical, really empowering, and you saw customers just jump on.

I’ve rarely seen a group of—a new technology that was so young at this time being adopted by so many people so quickly.

But, you also had this trend.

We’ve heard a little bit about microservices and service-oriented architectures.

But the fun part about it is, now, is that you’ve taken these big monolithic, applications.

The application is still there, but underneath that application, you have a bunch of smaller components that live together, that work together, but are often deployed by different teams, built in different languages—or at least, you can think about doing it that way—each with their own scaling characteristics, allowing you to move much more quickly.

Netflix had been doing this for years before the word “container” became even something most of us talked about.

But, containers made this a lot easier to do.

But here’s a problem that you have.

You now have to do this for many, many apps and many, many machines, and you have to figure out how to wrap your head around them.

It sounds like a great thing to do, but when you actually try and do it, that’s where you end up.

It’s not a fun place to be.

And about three years ago, that’s where people were.

So, what happened?

You started getting a bunch of container orchestration tools gaining on the market.

Container orchestration

I’m going to talk about ECS because that’s the one that I designed and built—and I know best. I’m going to talk about that, but we’ll talk about containerization in general.

What does a container orchestration tool do?

Container orchestration started off pretty simply. It wasn’t brand new.

It was built upon things like what you had learned from HPC systems that had been managing clusters and applications and processes for years, to things like Mesos that had been around way before Docker was a thing as well.

But you have a system that allows you to take the infrastructure that you have, keep tabs on all the individual processes and applications that are running inside it, but here’s where the fun part came.

In a VM/hardware/server world, you were still focusing on the overall infrastructure, and then keeping tabs on whether an application executed and finished, but overall, you were still looking at the broad and the physical infrastructure level.

You didn’t really have a notion of applications.

And one of the best parts about containerized orchestration has been that the application, whether it’s an ECS task, or a Kubernetes pod, or a deployment, or a service, actually becomes the first class object.

And how do you manage that?

How do you maintain that?

How do you make sure that the deployments are happening successfully suddenly became part of this framework instead of cobbling together a bunch of tools on top of infrastructure management.

It sounds like—sounds pretty obvious these days, but it wasn’t three years ago.

And so, I think that’s one of the—probably, the most interesting change in how we think about infrastructure that’s happened is, that the tools that we use actually have the notion, the semantic notion, of what an application is.

And that ties in really well into how you observe and reason about those applications.

The case of McDonald’s

So, what are people doing with this infrastructure?

I think somebody just talked about McDonald’s sandwiches. I kinda heard something.

So, who in this room has not had a McDonald’s burger?

I can’t see any hands, so if you raise them, I won’t be able to tell, but I’m assuming most of you have had one, or at least know what it is.

McDonald’s, for those of you who don’t know, is a place you go to eat food.

You get burgers, you get fries, you get shakes, you get other things.

I haven’t been to McDonald’s in a long time.

Anyway, they also are moving along with the times.

You don’t go to McDonald’s, anymore to get a sandwich, you don’t just go to a drive-through. You actually order McDonald’s from the comfort of your living room, and somebody comes and delivers it to you.

Awesome.

But, as a company, that did not do this and operates globally, McDonald’s has some challenges.

One, food delivery is a fast-moving business.

The features, and capabilities, and the types of food delivery services you integrate with change all the time. So they had to build a system that you could add features to very quickly.

This is McDonald’s.

So, they take a lot of orders, lots of people like eating there, the burgers are cheap.

So, the system had to be scalable, it had to be reliable. When you order your fries, you want them to get there.

They don’t operate in one country, and they don’t operate with one provider.

They may be using Uber Eats over here, Grab over there, across pretty much every country on the planet, from what I can tell.

And food delivery isn’t the world’s most profitable business.

It has to be cheap.

That was the fundamental business problem they set out to solve.

And how did they do that?

They could have built it on EC2 instances. They could have done it many other ways. There’s many ways to skin this cat.

But, today, when you’re building an environment like that, the chances are that you’re building it in containers.

There are very few people who probably won’t think about that as their first resort.

So, this is a classic architecture.

It works really well for our customers.

I’m using this as an illustration of how people are actually building systems.

So, they have their menus up there. This is people making the orders, and you have your third-party delivery platforms on the bottom left. That’s the people who go and deliver these.

So, as the orders come in, they go into a queue. You could choose to do it any other way, but a queue is a great way to put the data in.

As the queue starts getting filled up, they have a number of services that are different components of these applications.

Each of those services is essentially an Auto Scaling group running across multiple availability zones, running one or more ECS services that can scale independently and de-scale.

As orders come in, things scale. As orders dry up, people go home.

They’re not that hungry, it’s the middle of the day, you’re probably in your office, your Auto Scaling goes down.

But, you can scale your services independently.

Some services are relevant for one type of food delivery. Some are not.

At the backend, they’re using—and this is, again, a critical part of how, at least, we see people scaling their architectures—you’re using a set of managed data services.

It’s almost—it’s very rare these days except for the few people who really, really love managing databases.

For people who manage your own database, you’re using a managed system.

And they built an API that interacts with these microservices, interacts with the third-party deliveries, so when your Uber driver comes, they’re not delivering my burger to the person sitting in that corner.

That would not be nice.

So, at the end, they build an infrastructure that—they have built it in just a couple of—three months.

It worked with their existing—

The fun part about it was that, as existing AWS customers, they were still using the same Auto Scaling, the same SQS queues, the same metrics that they were using earlier, the same databases, but they were building the applications as containers, as microservices, which is pretty nice.

The way they built it, they can add a new platform, a new third-party provider pretty easily.

They can sustain—this was in November last year—all 20,000 transactions a second, less than 100 milliseconds of request latency. And it’s pretty cost-effective, even when they’re not getting a lot of orders.

It’s not an over-engineered system. They can scale it pretty nicely.

These kinds of systems where you have these individual services that scale relatively easily based on some kind of load metric, docking to some kind of data store, work really well.

And I’ll talk about this kind of architecture a little bit more in a slightly different context very soon.

More agile monoliths

This is another, not an uncommon problem.

You have customers with many, many deployments, lots of services, a bunch of instances sitting underneath, and many containers, but you’re not running these small, small, small microservices, you’re a company that had these monoliths and you’re trying to get more agile, as it were.

Well, one way of doing it is, spin up a cluster, you actually use that cluster as a cost boundary.

You decide, “This is what I’m going to do.

Because I’m paying for the instances and the infrastructure, I’m going to fix that.

And within this, I’m going to scale these monoliths."

When Team A comes in, they get a copy of that, they can scale that as they want and they can deploy that on to that cluster.

Team B comes in, I give them their copy of the monolith, and their scale is a little different.

They’re just going to use this little corner.

Team 3 comes in. You can do that way as well.

For most of our customers, this is what they do.

Not everyone’s on microservices yet. They’d like to get there, but they’re probably starting off with a monolith that they start breaking apart over time, adding components to it.

Containerized infrastructure does make that much easier.

You’ll be careful, you don’t wanna make the same mistakes that you were making earlier, but it works pretty well.

Reactive architectures

Getting back to these microservices, some of the more common systems that we see these days are what we call reactive architectures, and there’s actually a whole manifesto about what a reactive architecture means.

I stole a slide from one of our SAs. You can probably go track it down at some point.

But, basically, what this architecture does is split your app into a real-time app and a non-real time app.

As you can probably guess, the one at the bottom is your real-time app.

What you want, again, you could have a queue up front, you have a load balancer, pushing load to a cluster, to a bunch of my containerized services that can scale.

You immediately take processed data, put it into Redis, because you want to tell all your subscribers that some event has happened.

This is where Lambda comes in, and it’s really, really sweet because you can tie all of this together, because it’s just responding to events and making sure things end up in the right place.

And for your data, and just in pipelines, which are not necessarily real-time, you can asynchronously start pushing things, updates, to a more persistent store like Dynamo.

Again, the key part is using managed data stores, you’re using containers in the middle to scale all your services and APIs on any requests that are coming in, and you put your real-time data into a real-time data stream.

And it’s not that complicated an architecture. It works really nicely.

You can reason about it, you can start looking at your, you know, you got really nice dashboards and new features to look at over here, and things work pretty well.

Enter Fargate

So, cool, we’ve made a lot of progress over the last two years.

Nobody’s ever asked me that question, ever.

Well, maybe one or two people.

I’ve actually had somebody ask me where my S3 object sits.

So, why do we still think about clusters?

This is a question—I come from the HPC world, back in the day, where clusters were the things you heard the most.

My first job, I sat next to a 200-node cluster which is where everything ran.

I could see it from a glass partition, very happy to see the shining lights.

But clusters, to me, don’t make sense in the cloud, just the way nobody asked you which rack your EC2 instance is running on.

And why do we care so much which host my containerized task or containerized pod is running in?

It’s what we do.

It doesn’t make sense.

So, even before we actually shipped ECS, the first thing that we actually tried to figure out was what would something like Fargate look like.

It took us a while to figure out what people wanted.

We actually had to build ECS first. We had to get systems like Kubernetes out there and really have customers understand how they wanted to use those systems.

I’m grossly simplifying, but for those of you who don’t know what Fargate is, if you have a cluster of machines with a bunch of containerized tasks running in them, with Fargate, that cluster goes away.

You have a containerized app, you tell it to run, it runs.

You’re no longer paying for the machines, you’re paying for the resources that the containers are consuming by the second.

That’s where we started. There’s a long way to go, but it also makes—from my perspective, the way I think about it is, Fargate’s a new compute engine that’s designed to run containers.

So, from an AWS perspective, what it means is, you pick your orchestration tool. It’s ECS and EKS, which is a Kubernetes service. You pick your compute engine. Most customers today still run on EC2.

Over time, we think most customers are going to switch and run on Fargate.

And how do they use it?

How does this evolve?

How do they think about their applications?

Because you no longer have to think about bin packing, you no longer have to think about what kind of host you want to provision.

You have a declarative or some form of high-level definition of your application, and you want to just run it.

You have to do a lot more than that, but at a minimum, that’s what you want to do.

And Fargate, at least, as we think it, as our customers have told us, makes that process a lot simpler because, just cognitively, it’s a simpler process.

You can use similar architecture. That doesn’t completely—especially as people run them today—completely change the way you think about your applications, but it’s one layer less of infrastructure that you need to think about.

Again, you’re seeing these patterns of applications where the heart of your application is a bunch of load balancers or discovery engines sending traffic to a set of Auto Scaling services that talk to a managed database at the end.

You can keep repeating these patterns all day long, and they kind of work.

Monitoring Fargate

But, the Fargate journey is just getting started.

All of you who were running a log driver, suddenly don’t have one on your host.

You can’t log into a host and start introspecting your Docker logs, et cetera, that all goes away.

So, how do we think about those?

So, when we launched Fargate, our first partner was Datadog.

And we continue to work with folks like them to understand what does monitoring, what does observability, in a world where your hosts go away and you’re talking strictly at the containerized task, and the containerized service, or the containerized deployment level. What does that look like?

And we’re very interested in talking to everyone in this room who has started using Fargate, moving into production, like, what would they like to see?

What does a sidecar even mean?

Questions like that.

And I’m pretty excited to see where this ends up.

With that, that’s my email address. If you have any questions, any comments, any thoughts, let me know.

You can always find me on Twitter, DMs are open.

And, thank you very much.