Securing Datadog’s cloud infrastructure: Our playbook and methodology

Tim Gonda

At Datadog, we build and operate a complex, self-managed infrastructure that spans multiple cloud providers and serves many customers in regulated environments. We need to secure this large, distributed infrastructure while maintaining strict uptime requirements and scaling our finite people resources.

In this post, I'll detail the playbook that we use on Datadog’s Cloud Security team for securing our infrastructure, including:

The challenges we faced in securing our cloud infrastructure
Our Find, Fix, Remediate, Prevent (FFRP) methodology for cloud security
How we use Datadog Cloud Security Management (CSM) to help us implement our approach
How we partner with Engineering and other teams to carry out our cloud security strategy
How we ensure the right data gets to the right stakeholders

Challenges in securing our cloud infrastructure

On the Cloud Security team, we faced numerous challenges in securing Datadog’s complex infrastructure—and like all organizations, we had finite resources and time with which to achieve our goals. Specifically, we found that:

Misconfigurations that had already been remediated would frequently re-enter our environment through infrastructure-as-code (IaC) updates.
We needed more stringent governance in our development and experimentation (i.e., non-production) environments to ensure misconfigurations did not occur via manual mistakes.
Our operating system patching process required complex manual intervention and overly long time frames, since we weren’t able to rebuild and redeploy our virtual machine images automatically.
We had technical debt that was hampering our ability to scale and govern our cloud infrastructure as quickly and effectively as we wanted.

As a result of these challenges, our teams couldn't make as much progress with their security work as they wanted. It was like running on a treadmill: The more security issues we fixed, the more they seemed to return.

Our FFRP methodology for cloud security

In response to these challenges, we started to approach our findings differently using a methodology we call Find, Fix, Remediate, Prevent (FFRP). This approach helps us on the Cloud Security team more effectively tackle risks and avoid the aforementioned security treadmill.

To understand this approach, think about what would happen if you were on a boat far away from shore and discovered the boat was leaking. First, you would locate the source of the leaks and patch those areas to reduce the intake of water. Then, you would bail out the standing water and steer the boat toward the nearest shipyard to permanently repair the sources of the leaks. Once you were safe ashore with the boat repaired, you would work to reinforce the boat to ensure those leaks—and any potential leak sources—were less likely to occur again in the future.

When it comes to securing cloud infrastructure, we can think about FFRP in the following steps:

Find the most important or widespread problems, and locate the systemic root causes of those issues.
Fix the root causes so that these security issues do not appear again.
Remediate the remaining impacts or downstream effects of security issues, so the risk is reduced or eliminated.
Prevent the problem from occurring or being as severe in the future by establishing guardrails to avoid the situation and contain its effects if something similar happens again.

What we do in the steps of the FFRP methodology

This methodology allows us to focus on what matters most first, without jeopardizing our long-term security posture.

How we use Datadog CSM to implement our approach

To help us put our FFRP methodology into action, we rely on Datadog Cloud Security Management (CSM) to identify security issues, prioritize findings, and track progress toward remediation.

We partnered internally with the engineering team working on CSM to help them shape the custom frameworks feature—which allows organizations to benchmark their security posture against a customized set of rules—so that we could organize CSM findings by what we consider most important. We also drew on these priorities to help our CSM engineering team develop and roll out the Essential Cloud Security Controls Version 2 framework, available for all CSM users. With this set up, we can filter findings in the CSM Explorer to show us only issues that violate our custom framework, so we have a more prioritized list of findings.

The screenshot below shows an example of this from our demo environment, where we’ve scoped CSM to show us findings that violate our custom security framework Shopist Cloud Security Baseline, which we created specifically for our demo application Shopist.

CSM findings against a custom security framework

From here, we can move on to the “fix” and “remediate” portions of FFRP by using team tags to show us the large sources of vulnerabilities in our infrastructure and to partner with the core stakeholders responsible for those technologies. We work hands-on with them to fix these problems, to ensure issues aren't left unfixed due to low team bandwidth or lack of awareness.

We use the Compliance view in CSM to track progress in adherence to our custom frameworks and to inform our efforts in the “prevent” stage of FFRP. Here again is an example from our demo environment, based on the custom framework for our demo application.

Compliance view in CSM with custom security framework

How we partner with Engineering and other teams to enact our cloud security strategy

The Cloud Security team at Datadog works closely with our engineering teams on requested features for Datadog CSM. We also partner with our infrastructure teams to offer direct help where bandwidth is limited, so they can make progress on issues that may not be their highest priority at the moment. We call this silo-less responsibility, where multiple teams can be responsible for intersectional duties.

In the silo-less responsibility model, teams acknowledge that there is more work to be done than either team has time for, and each team volunteers to work on security issues in chunks based on their skillsets and areas of focus. To make this model work, our teams have to:

Establish mature communication channels
Be willing to be flexible with the work they take on
Avoid ego-driven decisions on who works on what
Focus on the end goals rather than fixating on RACI charts

We’ve also found that it is essential to embed security into our platform components and workflows. A company the size of Datadog can’t rely on controls that remediate issues after they’ve already happened. For us to move at the speed and velocity our business requires, we need to scale our security knowledge in the form of business logic embedded in the platforms that manage our infrastructure and cloud environments. This means providing configurable controls and filters that block or warn administrators of risky infrastructure configurations before they are ever put in place.

On the Cloud Security team, we leverage Datadog IaC Security to identify dangerous conditions in our engineers’ infrastructure configurations. We can surface this information to them using Datadog itself, or by commenting on the related pull requests. Warning engineers of these risks in a consumable way enables them to fix these configurations before they introduce them to our environment, the same way they fix code quality issues.

Misconfiguration finding in Datadoc IaC Security

Serving the right data to the right stakeholders

Often, engineering leaders leave vulnerable resources unfixed simply because they didn't know the issue existed, or because they have many issues and don’t know which to prioritize. To help us avoid this pain point, we needed to create a prioritized list of issues that could be sent to engineering managers and leaders on a recurring basis, so they could stay on top of these issues without feeling overwhelmed.

In addition to using custom frameworks in CSM to help us define our security goals, we use the Security Inbox to pinpoint the most critical issues that require remediation across our infrastructure. We also define Security Contacts in Datadog Teams to route a prioritized list of findings to the right team members. This enables us to prevent our engineering teams from becoming overwhelmed—instead, they can address the top five to 10 most severe or widespread issues, then focus on the next five to 10 once the top priorities are fixed.

In addition to communicating the right data to the right stakeholders, it’s also important to provide the right granularity of information. On one end of the spectrum, we had security leaders who needed a condensed version of security issues in a summarized manner attributed to their areas of responsibility to help them prioritize upcoming work. On the other end, we had engineers who needed more unstructured and comprehensive datasets to make sense of the issues they knew about and understand how far a particular problem had spread.

We use a combination of features within the Datadog platform to help us provide the right types of data for our different users:

The CSM Security Inbox provides a summarized list of the top risks. This is a more filtered view that is most applicable for security leaders looking to understand priorities for remediation at a high level.
The Explorer Views in CSM—Misconfigurations, Vulnerabilities, and Identity Risks—provide context around key issues that helps engineers understand what they should fix first and track progress.
The Datadog Resource Catalog is our least filtered view, providing a unified inventory of all resources across our infrastructure. The Resource Catalog’s Security view enables engineers to spot issues in specific types of resources and easily pivot to granular data on individual instances for deep investigation.

The results of our approach

By implementing the FFRP approach and focusing on fixing security issues systemically rather than piecemeal, the Datadog Cloud Security team eventually was able to reduce our number of open vulnerabilities in CSM to zero and maintain that posture. Because we partner closely with our colleagues in Engineering, we can more easily address new vulnerabilities before they are deployed onto our infrastructure. And by providing engineers and engineering leadership with the right data in a consumable manner, we empower stakeholders to remediate cloud security risks themselves, without having to send Cloud Security personnel to interface with countless teams to address vulnerabilities.

We are always working alongside Engineering to map out new features for CSM, which we adopt internally. Our eventual goal is to add new rules to our internal cloud security frameworks over time as new rules are added to the product. By focusing on solving the systemic issues in our own environment, we can better prevent issues from recurring, focus on the most complex security issues we face, and help our engineers build features that enable the broader cloud security community to defend their environments more effectively.

Check out our CSM documentation to get started using these features. If you’re not yet a Datadog customer, sign up for a 14-day free trial.

Securing Datadog’s cloud infrastructure: Our playbook and methodology

Challenges in securing our cloud infrastructure

Our FFRP methodology for cloud security

How we use Datadog CSM to implement our approach

How we partner with Engineering and other teams to enact our cloud security strategy

Serving the right data to the right stakeholders

The results of our approach

Related Articles

Add security context to observability data with Datadog Cloud Security Management

Datadog Security extends compliance and threat protection capabilities for Google Cloud

Customize rules for detecting cloud misconfigurations with Datadog Cloud Security Management

Improve the compliance and security posture of your Google Cloud environment with Datadog

Start monitoring your metrics in minutes

Get Started with Datadog

Challenges in securing our cloud infrastructure

Our FFRP methodology for cloud security

How we use Datadog CSM to implement our approach

How we partner with Engineering and other teams to enact our cloud security strategy

Serving the right data to the right stakeholders

The results of our approach

Related Articles

Add security context to observability data with Datadog Cloud Security Management

Datadog Security extends compliance and threat protection capabilities for Google Cloud

Customize rules for detecting cloud misconfigurations with Datadog Cloud Security Management

Improve the compliance and security posture of your Google Cloud environment with Datadog

Related jobs at Datadog

We're always looking for talented people to collaborate with

Start monitoring your metrics in minutes