Serverless From Scratch (OpenFit) | Datadog

Serverless from Scratch (OpenFit)


Published: July 17, 2019
00:00:00
00:00:00

Everybody hear me fine?

There we go.

Now I can hear myself.

Everybody, welcome.

Thank you for joining.

So, Openfit, you’re gonna have to do some jumping jacks right now, so you guys ready for that?

Everybody: No.

No? Yeah, wake up a little bit.

I’m not gonna make you do jumping jacks.

So welcome, I was introduced as Reza, but I wanna share something with you that for the longest time, my nickname was Rez 300.

And I will tell you why.

And some people might know what I’m talking about if you’re in SRE or incident response.

All right, so a couple questions, actually one for now.

How many of you work out four to five times a week?

All right, right on, right on, okay.

Openfit’s typical customers

So, as it turns out, our user base is much like you guys, they’re very, very committed and dedicated to their fitness regimen.

And they do it pretty consistently.

Now, imagine it’s 6:00 in the morning, right, and you get up, you have your coffee, or you have your pre-workout drink.

And some of you if you had those, they wire you up pretty good, you go from zero to 100 pretty fast.

You put on your clothes, you get in front of whatever device that you have that’s going to stream our application or your makeshift gym in the garage.

And you basically turn on, you go to your favorite workout and click play, and boom, no stream.

So you know, you have an error that says “Stream not available.”

And being that it’s 2019, you get upset, you run to Facebook, and because you’re part of this fitness group that keeps each other accountable, which is actually a really good thing,

you start saying, “Oh, I can’t get my video to work, it’s not streaming, it’s not buffering correctly.”

And everybody’s like, “Yeah, I’m having the same problem.”

So you know, all the negative comments start flowing and it’s definitely not good for our business.

So social media, one of our challenges.

This was our reality for many, many… I would say, it’s a couple years.

If we ever had a marketing event or a new program that would come out, and we would blast out emails, this is a new program.

And usually, our programs are pretty popular a lot of people will flood in that Monday, or whenever the program was launching, and it would break the site.

We couldn’t scale correctly, our stack was not very scale friendly.

And I can get into that a little bit later on.

And we would have all these issues all the time.

And we were always at the edge of our seats when marketing had to do this and announce a new program.

An introduction to Openfit

So as you might have figured out by now, we’re a digital fitness streaming platform, much like Netflix, but for fitness.

And we’re actually a sister company of Beachbody.

So Beachbody you guys know, Openfit is newer.

And we are right now currently on web, iOS, Roku, and Android, and those have been our very popular platforms.

Our mission is to make fitness easy, accessible, and affordable.

So you’re traveling, you’re out there, you wanna just pop our videos on your mobile device, and we wanna make it easy for you.

We wanna make it quick so that you can get your workout and done.

And, of course, provide a stable service, right?

So it really is paramount that when you go there to press play, that that thing loads up, because you’re ready to go, right.

And we all know that working out can sometimes be challenging, and you might not have the motivation for it, so anything that stops you from doing that is not a good thing.

So that is our mission to provide a stable service.

Crunching the numbers

Stats: since we launched on January 17th, we have about 45,000 users, not bad for a few months.

So marketing plays with acquiring users all the time, sometimes they turn it up sometimes they turn it down, and they’re trying to figure out the best way of acquiring.

We’ve had 400,000 qualified views, video plays, which is pretty impressive for a few months.

Basically, that’s start to end, we track that and make sure that accounts for one video one from start to end, that’s a qualified video.

In order to do that, we have about 65 million service calls per month, and on our serverless platform.

And of course, that increases every month as we get more users.

And that is supported by 206 Lambdas across 26 API gateway endpoints.

So split between a blue and a green.

So we’re always running a blue and a green at all times.

How Openfit settled on serverless

All right, so obviously, we had scaling problems that we needed to solve.

We looked at different technologies, we looked at how are we gonna handle this.

Are we gonna rewrite our stack, which at that time was in EC2s, no containers, it was a LAMP stack.

It was put in together pretty quickly, just like all the MVPs that get out there.

And we wanna see…you know, do we have to rewrite that to work with servers?

Or do we want to go into a container model and work with orchestration?

But we looked at serverless and we were like, “I think that’s like the best fit for us.”

You know, the more we looked at it, the more it appealed to us.

As with all transformations, you have to make sure that it fits your model, right?

You can’t just go out there and pick a technology and say, okay, that’s the latest and greatest so it’s gonna work for me, no.

So, my second question, is anybody running serverless right now?

Oh, a few, okay.

Are you guys 100%?

Partial?

No, okay.

So we’re 100% serverless, and I’ll get into all the different serverless stacks that we have.

Real quick, serverless is basically any component of the cloud offering where the management scale and availability is all handled by the cloud provider.

And examples: Cognito, Dynamo, and of course, Lambda.

We’re in AWS, but you know, different cloud providers have their own flavor.

Of course, you’ve heard Lambda being referred to as function as a service, basically that’s what it does.

Runs the function and provides you that service.

Now, serverless doesn’t mean server free, right, this is running somewhere, it’s not like it’s running in a black hole.

There’s microcontainers and VMs that are running it, but of course it’s all managed by the cloud providers.

And there are definitely pros and cons in using serverless.

You know, you have to make sure that your application fits the model.

And with serverless, you’re running code only and not servers and code.

So that’s the benefit of the serverless model.

Serverless vs servers vs containers

Now, let’s really quick compare servers against containers and serverless.

How many people are running… A lot of people are running containers right now.

Anybody running pure EC2 instances with a stack on them?

Okay, a few, awesome.

I’m sure it’s sometimes painful for you guys.

Servers, right, traditional, whether on-prem, off-Prem, VMs, physical, they’re hard to maintain, they’re hard to update, a lot of security.

If you’re really serious about security, you have to make sure that they’re very compliant.

Everything that you need to worry about with servers, right.

So you have provisioning the instances…and I’m talking about EC2s, but this goes for pretty much all the servers.

Your scaling groups, your security, your VPC, your network configuration, server updates, framework updates, maintaining the code, monitoring storage, monitoring time sync, and security scans.

With enterprise-level companies, if you have a pretty robust security team, they always wanna probe your servers.

So that recently came around on the Beachbody level, that we need to probe all the servers and make sure that they’re secure.

Containers, great technology, I got nothing against containers.

They are fantastic, a step up obviously from servers, run your code anywhere.

A lot of you know, tools, and frameworks, and orchestration, stacks that handle containers.

But you still have to worry about the image, you have to worry about dependencies, frameworks, container security, have to worry about orchestration.

And still with your network, of course, maintaining code level security, all that stays the same.

Enter serverless.

So with serverless, this is pretty much all you have to worry about.

Maintaining code, maintaining your dependencies, and maintaining code level security.

So everything else as far as the networks and how they’re run, of course, you can run them within VPCs and outside of VPC and I’ll explain that.

But mostly, if you’re running outside of the VPC, this is all you really need to worry about, maintaining your code, maintaining your dependencies, and maintaining the code level of security.

Some notes on serverless workloads

All right, so what are compatible workloads?

That’s you know, a fundamental important note that you have to take as far as when you wanna choose a technology for your stack.

RESTful APIs that do crowd actions, they’re perfect candidates for serverless.

Usually when you’re updating something or creating, it’s pretty fast it’s not a data job, that’s long-running.

You know, actions like registering a customer logging in, retrieving content, modifying account preferences, those are all pretty short transactions, and they work very well with Lambda.

As it turns out, Openfit, that’s the majority of our transactions.

Our business model is actually pretty, fairly simple.

We’re going to definitely get more complicated with our services and what we offer, but it’s literally like Netflix.

You know, you go in there and you register a user, the user buys a subscription, you give them entitlement to the programs, they log in, they stream, they look at the programs, that’s it.

So short transactions.

Now, let’s define what short means.

It’s not like it has to be in milliseconds, or a couple of seconds.

Lambda has a 15-minute timeout, so you still can do a lot of work within that, that 15 minutes.

If you have a data job that you need to run at night, you need to transfer some records over from a service to a service, why not use Lambda, it’s pay-per-use.

And you wouldn’t have to have a server sitting there idle costing money.

So data job running in a serverless, you save a lot of money, just using a function versus having a server that you have to worry about.

Concurrency in serverless

Concurrency, very important for when you wanna scale, correct?

I mean, that’s basically the meat and potatoes of scaling.

You know, you should be able to handle lots of concurrent connections and requests at the same time.

So Lambda is extremely good at handling that.

And basically, there’s a limit, there’s a soft limit on Lambda in AWS, at least for 1,000 concurrent connections.

But you can always increase that once your application reaches that level of traffic.

With concurrency, the way it works is that a function at any time will go and grab however many connections that it needs, from that pool of 1,000, and then run it and then release it back to the pool.

So Lambda, very good for that.

We are running at around maybe 50 concurrent connections top right now.

Of course, our parent company Beachbody is running at somewhere around 1,000 to 2,000, which is pretty high, because they have 1.6 million users.

But we’ll get there one day.

Concept memory and time balance

That was about…Oh, another concept memory and time balance.

So Lambda functions are…basically the cost is calculated based on the memory that you give it and the time of execution, and increments of 100 milliseconds.

So every 100th millisecond, you’re getting charged for more.

Meaning like if you add 50 more second, you’re getting charged for 100, 100, you’re 100 and 150 for 200, and so on.

If your whole application is serverless, you have to be careful that you don’t have transactions that are running too long, right?

Again, if you have one execution that’s running at nine and doing a data job, perfect.

But if all your transactions are taking 7 seconds or 10 seconds, then you would be having a problem anyways, because your site will be pretty slow.

But you’re also gonna be incurring a lot of costs.

So you have to be very careful with the memory and time balance.

And I’ll talk more about that later on.

Security for serverless

Security, so by no means am I a security expert, there’s way smarter people that do that at our company.

But what has happened with Lambda is you can run them…if you’re familiar with this, you can run them within a VPC or outside of a VPC.

So basically we’re running everything outside of VPCs, and you know, everything is public-facing.

That doesn’t mean that we’re not secure, and I’ll get into that.

But that simplifies a lot.

We don’t have to worry about VPC configuration, NAT configurations, network settings.

You know, so when a developer, let’s say, launches something into the wild opens up a security group, we don’t have to worry about that.

So what the serverless model allows you to do is, is basically run outside of that.

All of our APIs are secured in different ways, which are with Cognito authorizers.

If people use Cognito we log the users through the Cognito, and then all of our APIs basically have to be authorized from that point on.

Obviously, API keys, and you know, we use SSL Lambda because it’s ephemeral, it makes it very hard for attackers to grab your process and do damage.

They still can, but you know, it’s much easier to do it on a server.

And especially servers being the number one security risks for the operating system, on patch servers.

You know, once the attackers get in there, then they can wreak havoc.

With serverless, it’s always patched by the cloud provider.

You know, all the dependencies, all the OSs and what have you.

So serverless makes it very hard for an attacker to do a lot of damage.

And that’s also very important when it comes out to timeouts.

So you have to make sure that you use a good timeout model on your Lambdas, right.

So if something… you want the time out to be, let’s say 10 seconds or 5 seconds for something that takes a couple of seconds to execute.

You don’t want it to be a couple of minutes, because then it’ll shut down, and it will be hard for somebody to get a hold of that process.

I talked about our Cognito authorizer.

And then, of course, other security concerns OWASP top 10, those always stay relevant.

And also code level security sanitizing your inputs, validating your data, those still remain the same.

So you have to do that to be secure.

The Zero trust model, I talked about it, it’s basically if you’re running out of VPC…even if you’re within a VPC you want all of your services to be authorized when they’re calling each other you know.

So don’t leave them open even if you’re behind a VPC.

It’s good practice to have them talk to each other, and say, “Hey, are you authorized to call me?”

So when you go outside of VPC with Lambda, that forces you into a Zero trust model.

So you have to make sure that everybody is getting authenticated against each other.

Obviously, least privilege principle when it comes to roles.

You know, beware of third-party packages, when you include them in your functions, those are another security risk.

But much more manageable with using with serverless and then Lambda.

And then, of course, protecting the user data at all levels with SSL and other security layers.

Serverless costs

So cost.

Let’s imagine you have an EC2 server or a container and it’s running, and it’s fielding your requests and sometimes it’ll go idle, it’s not doing anything.

And you know, you’re paying for that resource.

With Lambda, you’re pay-per-use, and it’s a very nominal fee.

Sometimes I wonder why AWS doesn’t hike up the price.

But AWS, if you’re here don’t listen to that.

But you know, it’s very, very, very affordable.

If your functions are short and very optimized, it’s a lot of cost savings.

I really have not come to a model where the amount of requests run through an EC2 cost less than a Lambda.

Of course, if you get to millions of requests per hour or where you have your EC2s, they’re always getting traffic, so it makes sense to put them in EC2s.

But then what happens with that instance, you gotta have to create more EC2s so that you’re not running at 90% CPU.

So it manages to balance itself out when you’re using Lambda.

That’s at least what we’ve found with our model or our transaction times.

So be careful, there’s a breaking point, you wanna make sure, again, your transactions are not too long.

Indirect cost of not running serverless again, patching servers, security, you have to get…if you have thousands of servers, containers, you still have to have people manage those and make sure that they’re patched up.

So most likely, you’re gonna have to hire some people to take a look at that on a continuous basis.

So cost, so we talked about this, this is my favorite slide over here.

So we run, as I mentioned, about 65 million calls per month.

Let’s say hypothetically, that we’re giving 128 megabytes of memory per function, over all of the functions.

If that were the case we’d be running at around 10 second per function, so it would take him 10,000 milliseconds to execute.

And we would still pay pretty low for 65 million calls, $76.

But if you double the memory, now you’re paying $43 because your execution time is actually dropping to 3 seconds.

So this is what I’m talking about tuning your functions.

You would think that, oh more memory is more costly, but actually, as it turns out, the execution time is the big cost factor.

Once you get into the bigger memories, as you can see, second negative exponential effect where there’s a point where the more memory you throw at it, it’s not gonna do any good.

Lambda does provision CPU based on memory.

So you’re actually getting more CPU.

So I highly recommend if you are running heavy workloads in serverless that you take advantage of…there are frameworks out there, that you can run your Lambda function against, and they’ll tell you what is the optimal memory and time point.

Datadog + serverless

All right, so this is our serverless view in Datadog.

And they actually…this is my favorite view because I’m always in here.

It’s the serverless view, it used to be called Cloud Functions but now serverless.

And as you can see you have everything from performance and all the way to cost metrics, right there.

We basically look at these functions…from a cost perspective, we’ll look at these, we’ll keep an eye on them, and see if the memory is running hot, or it’s being underutilized.

So we can always use this view and of course drill down to the function to optimize the function costs.

And of course, performance as well, invocations, durations, if there’s any errors, and we use this extensively.

And I’m really happy with the way that they’re going with serverless, they seem to be the most forward-thinking when it comes to serverless.

I did one little story: when this wasn’t available, like about a year ago, I started building all of these metrics through logs.

So I was taking milliseconds in memory and trying to come up with the same, and then one day I open up, and it’s right there, and I’m like, oh, all that work that I did.

Life after serverless

But anyways, it’s still great that they put it out, and they’re gonna put out more.

All right, ever since we went serverless—our users very happy, right?

So they’re not waking up and getting all juiced up in the morning and then see a blank screen when it comes to their videos, they’re much happier.

So our users are crazy, they run these makeshift live workouts where they’ll actually FaceTime each other, put the FaceTime, they’ll point it to themselves, and everybody’s like looking at each other.

And they run this, this live kind of workout, of course, to motivate each other.

And that’s actually pretty good, I love that.

And we hear about those stories that when our platform is stable, they have a lot of fun doing those sorts of stuff.

And of course, with happy users, we have a better social media presence.

And they’re happier they’re actually…they create groups.

They create workout groups, and they talk about the specific workouts, and, “I did this today, I did that today.”

“It was fun, yeah did you have fun?

How did you think the exercises were hard or easy?"

So they’re very involved in that.

So we’re getting a lot of positive feedback on social media.

And we’re very happy about that.

And, of course, with increased happiness, comes increased user retention. This is a very, very competitive field, there’s hundreds of apps out there for fitness now.

Especially in the last three, four years, they’ve been popping up left and right.

So when your users are happy, and they can tell other users about it, it’s all about the experience that you’re having, doing that workout.

And we’re actually in the process of merging with another company.

It’s official now, so I can talk about it, where were they run live workouts.

It’s very interesting model where a coach sets up your exercise routine, and then watches you and then other users can watch it.

Sort of what the users are doing anyways but now officially.

If you want to take a run a lot of people won’t run or walk, and you could do it as a group.

It’s actually really cool because you can see your progress, you can see your pace, and the coach is telling you, “Hey, pick up the pace.”

So a lot of user happiness around that, and we’re happy that we’re helping create happy users.

Higher reviews on the app stores, of course.

And as I mentioned, word of mouth is very, very important because it’s a very, very competitive field when it comes to fitness.

All right, impact on business, stakeholders are happy, management is happy.

You know, they don’t have to worry about constantly being down, not having trust in the technology, always having to worry about scaling and loading issues.

Obviously, marketing, a new program was always an issue for us.

We would have to have capacity planning, we would have to talk to different teams, and make sure that they’re on deck and ready for incident response because we knew that we’re gonna have issues.

But we don’t even talk to those teams anymore.

It’s been like a year and some, we don’t talk to them.

We don’t care what they do, they do their marketing, it hits our servers, we’re good.

As a collective our minds, the minds of QA, the minds of product people, even the minds of the creative team, everybody’s just happier with not having to deal with issues in production and scale issues.

And, of course, as I mentioned, the cost savings that’s a definite big plus for us.

How serverless impacted operations

Operations, that’s my team, exponentially more stable.

We’ve had zero incidences to date, after launching in January 17th, that had to do with our infrastructure.

So we’ve DDoSed ourselves a couple of times, it was a lot of fun seeing the graphs shoot up.

And it was able to handle it, Lambda handled it, Dynamo handled it, so it was probably expensive, but we didn’t have any outage due to our infrastructure.

It’s always been…We do use a lot of third party systems, which we’re simplifying now.

Shopify, Recurly, if it’s been down, it’s always been on them, or if AWS or any cloud provider goes down themselves, that we can’t do anything about.

But no incidences due to increased traffic.

And we’ve had a couple of announcements and couple of programs come out within the last few months, and we didn’t see any problems.

NOC and SRE obviously, very happy people.

And security model has become a lot simpler, we don’t have to worry about the servers.

Just recently, just recently, we had our big initiative for discovery of all the servers across Beachbody and Openfit, and believe me, there’s a lot of servers there.

And basically, I had to just go in and check off nope, nope, nope, because we don’t have any servers.

All those guys had to install roles in their accounts, give privilege to the probing servers that were looking for security holes, and it was just great to not have to worry about that.

So, simplification, but be careful, you still…it doesn’t mean that you don’t have to worry about security, there’s other stuff that you have to worry about.

Ease of monitoring, obviously, when you go all serverless and you’re in one technology, it’s much easier.

But even if you’re just on a server or a container model, you still have to monitor the infrastructure when it comes to those servers, and the code as well.

But right now, all we really do is we monitor just our codebase.

Of course, we monitor the Lambdas and the latencies, and everything, but it’s become much easier to do that, versus having to worry about servers themselves.

Much better performance with continuous delivery.

We have immutable artifacts that we ship around when code gets merged it gets built and that goes through the serverless framework and gets deployed everywhere.

Again, it’s simplified because of using serverless and not having to worry about provisioning servers.

And you know, always there’s something with server building, a package going wrong, or some security thing we need to configure when it comes to servers.

So, remember, I told you that my nickname was Rez 300? That was because I was waking up at 3:00 in the morning, pretty much twice a week.

Not anymore, no more 3:00 a.m. wake up calls, I haven’t had any.

And let’s not jinx it.

I am happy to have given up on that nickname, it was kind of a badge of honor thing for a while.

But it gets pretty old pretty quick as some of the SRE people would know.

So yeah, that’s the story behind that.

Monitoring serverless

Monitoring, so we follow the three pillars of observability and keeping with the industry best standards and practices.

Our logs are the most important metric that we have or pillar that we have.

We use structured logging, and we log the heck out of everything.

With Lambda, obviously because the function executes and it’s done, you wanna capture as much information as possible.

So we rely heavily on logs.

We do have X-ray on standby to do tracing as far as APM, but we really haven’t had to turn it on.

We’ve been able to debug everything when it comes to logs.

But in case we need to we can always turn that on and get that information to Datadog and do that.

And I know for a fact that they’re working heavily on the next generation APM for serverless.

So I’m really excited about that.

Metrics on everything, of course those are default, they come in through CloudWatch.

And we have alerts built on all of the metrics, pretty much everything we can think of we have.

And we have actually aggregation from different status pages, different RSS feeds, different endpoints for different users.

So when something goes wrong, I literally find out within seconds that a third party system has gone down.

We ping everything synthetics, so it’s been really, really good.

How am I on time because I have a zero over here, am I good? Sorry.Woman:

Two more minutes.Reza:

Okay, excellent, all right I don’t wanna ramble on too much.

So, that’s when it comes to monitoring.

With Lambda it is a challenge because it’s ephemeral, and it’s short-lived to…it’s not your traditional server and having agents in there.

But they’re coming up with really cool stuff, as far as APM and tracing.

And happy to announce that our front end is also serverless-ish.

So, we are running Node, we’re on React and Redux stack, and we’re running Node in containers.

So that was the only bits of containers and servers that was left.

So we use… I don’t know if anybody heard of Gatsby?

Yeah, so it basically takes your React application and it creates static files, flat files across.

It’s really cool, I’m not an expert at it, don’t ask me any questions about it, that’s the front end team.

But what happens is that now we don’t need to run Node, we don’t need to run Express.

And we basically have put that in S3 and behind CloudFront.

So what better way right? 30, 40 millisecond response time, very fast, scalable, talked about Node.js.

And then if we need to do any routing, any special routing, we can do it at the edge.

We actually use Lambda at the edge extensively in our ecosystem, to make sure that even our origins are not getting hit as less as possible so that we can be pretty fast and efficient.

So yeah, so that’s the front end and I’m happy with that little bit, it basically put us completely into 100% serverless mode.

And thank you.