In complex cloud environments where the speed of development is accelerated, managing infrastructure and resource configurations can be an overwhelming task—particularly when certifications and compliance frameworks like PCI, HIPAA, and SOC 2 present a lengthy list of requirements. DevOps and engineering teams need to ship code updates at a rapid pace, making it easy for them to accidentally overlook misconfigurations. Meanwhile, security teams often don’t have the context they need to understand resource behavior and ownership, identify the highest-priority vulnerabilities in the environment, and communicate the necessary changes to developers.
Having a comprehensive strategy to manage these challenges becomes even more critical with a large-scale environment like Datadog’s, which includes hundreds of Kubernetes clusters and tens of thousands of nodes, and which supports our own infrastructure as well as those of our customers. Adopting Datadog Cloud Security Management (CSM) has given our internal security and risk management teams increased visibility into where risks lie across our infrastructure, which helps streamline remediation and simplifies the process of collaborating with engineering to maintain a healthy security and compliance posture.
In this post, we’ll show you how our internal teams use Datadog CSM to effectively monitor our vast, complex infrastructure for misconfiguration risks by:
- Obtaining a unified security view through shared dashboards
- Understanding which teams own misconfigured resources so they can be notified
- Streamlining the process of prioritizing and remediating misconfigurations
Obtaining a unified security view through shared dashboards
At Datadog, we monitor more than 400,000 infrastructure resources—including more than 300,000 hosts—that support nearly 7,000 interrelated services. Even with an infrastructure this massive, we’ve successfully used CSM to create a high-level, shared understanding of our security posture—specifically, the most critical misconfigurations present in our environment and how we are progressing with remediation.
Our teams involved in risk assessment and remediation rely on the CSM Overview page for a high-level view of what’s happening with risk across our cloud infrastructure and resources. Security team members use the Security Inbox section—which identifies the most important security issues requiring immediate action—at the top of this page to understand which issues should be prioritized for remediation. They also use it to track our overall security posture score and performance against the requirements of major compliance frameworks. Engineering and leadership teams can also access this dashboard to quickly comprehend risk at a high level across the organization.
The following screenshot provides a snapshot of the information available on the CSM Overview page—note that here and throughout this post, screenshots have been taken in a Datadog demo account, where we introduce failures intentionally for testing purposes, and do not reflect production resources or customer data.
In addition to utilizing the Overview page, we’ve customized CSM’s out-of-the-box dashboards for our specific needs. Alongside total misconfigurations, misconfigurations by provider, and other high-level metrics, the views from our customized dashboard show total misconfigurations prioritized by severity and aligned to specific teams. We’ve also added a widget that displays which rule violations have the highest number of misconfigurations associated with them.
This dashboard provides a consolidated view that helps us understand what the most significant risks are in our environment and communicate about our security posture with multiple teams across the organization.
Understanding which teams own misconfigured resources so they can be notified
With more than 100 different engineering teams responsible for maintaining and managing infrastructure resources at Datadog, identifying the right contacts to address a security risk can be challenging, especially when remediation is time-sensitive. In addition to providing a centralized hub to monitor our security posture, CSM also enables us to map misconfigurations to the specific engineering teams that own the affected resources. This capability helps our security and risk teams quickly determine who they need to contact for mitigation or remediation.
To achieve this, we use the Teams feature in CSM to create team groups and assign users to them. These teams can then be associated with specific resources that are being monitored by Cloud Security Posture Management (CSPM).
With access to the list of team members responsible for each cloud resource, our security analysts can easily get in touch with the relevant contacts for a misconfiguration—including by starting a Slack conversation directly from Datadog. From there, the engineers who own the resource can start tackling the steps needed for remediation, speeding time to resolution and improving our security posture.
Streamlining the process of prioritizing and remediating misconfigurations
An infrastructure of Datadog’s size will typically generate hundreds of misconfiguration findings in CSM each day—but only a small portion of these may represent pressing issues that need to be addressed immediately. When the security team reviews misconfigurations in CSM, we identify which issues need to be remediated most quickly based on their severity and the risk of breach that they present, typically focusing on violations ranked “critical” or “high.” We then share these high-priority misconfiguration findings with the engineering teams that own the affected resources.
However, many engineers may not immediately know where to start with remediation, as developers typically don’t have expertise as security practitioners. With Datadog CSM, engineers can find steps for remediation directly through the platform, which includes detailed side panels with information on how to configure assets to more secure states. This allows engineering teams to self-serve when deploying security fixes, helping remediate issues faster and easing the burden on security engineers. The video below provides an example of this workflow, for a hypothetical S3 bucket misconfiguration in our demo environment.
CSM also allows teams to automate their response with one-click remediation through Datadog Workflow Automation. For example, the rule “S3 buckets are not publicly exposed via bucket policy” includes an out-of-the-box workflow that enables engineers to automatically restrict the bucket from public access.
However, not all misconfigurations need to be fixed, as there may be valid reasons for certain types of rule violations. For example, some S3 buckets may need to be publicly exposed because they contain static web content that is intended for public access. For any misconfiguration that does not represent a true risk, we use the Mute function in CSM to risk-accept those findings. This allows our teams to focus only on misconfigurations that require remediation action.
The Mute function also enables us to enhance our reporting capabilities. We can filter for muted misconfigurations based on the reason specified when the action was taken, or create shareable timelines of who muted what, when, and why for audit purposes.
A consolidated view to identify and remediate risks
By allowing us to share custom dashboards, tie misconfigurations to specific teams, and quickly prioritize and remediate issues, Datadog CSM has allowed our internal security, risk, and engineering teams to collaborate more effectively on improving Datadog’s security posture. Check out our documentation to get started with CSM and try out these capabilities for yourself. If you’re new to Datadog, sign up for a 14-day free trial.