Security is essential to cloud-based development, but integrating it into dynamic, distributed environments is difficult. Factors like complex architectures and operational constraints often create roadblocks, which makes it harder to enforce security policies and mitigate threats. These challenges are especially apparent within security organizations, where siloes limit their ability to keep pace with the larger organization as it scales.
At Datadog, we are proactively changing this industry paradigm by taking advantage of workflows and best practices that are rooted in Site Reliability Engineering (SRE). To accomplish this, we’ve merged our SRE and security groups into a single organization, unifying all aspects of our operational and security posture. This approach enables us to apply practical, vetted SRE solutions to security challenges (and vice versa).
In this post, we’ll look at essential elements of SRE and security, the benefits we’ve realized by combining the two disciplines, and what that approach looks like for us.
Essential elements of SRE and security
At its core, SRE serves as a bridge between development and operations, offering engineering best practices and in-depth architectural knowledge to make systems less prone to failure. A key factor of the SRE discipline is its broader influence on risk management, governance, and incident response. SRE looks at these areas from the perspective of an engineer, bringing in automation, observability, and continuous improvement to solve challenges. For risk management, this involvement may look like the introduction of efficient SLOs and game days to improve application reliability. Practices like automated change management and enabling audit trails for governance reflect SRE’s focus on automation and observability.
Security also plays a critical role in development and operations. More broadly, security teams are responsible for protecting the entire delivery pipeline as well as the integrity of individual entities (such as applications, services, and resources). In practice, this means inserting security into daily development workflows and extending it to cloud infrastructure. Security’s domain also includes operations, ensuring that the internal systems that support the organization are secure.
Because of their related roles in maintaining systems, there is a lot of overlap between the SRE and security disciplines. That’s why we decided to combine the two realms in order to strengthen our organization and the products we’re building for customers.
Benefits of the combined approach
By integrating SRE methodologies with security, we can inject our in-house security expertise into well-established SRE practices (and vice versa), enhance incident response, and continue to foster a culture of continuous improvement. Combining these disciplines doesn’t just enhance operational resilience and efficiency; it enriches our security culture by breaking down silos between development, operational, and security teams. Additionally, this approach enables security teams to participate in many types of incidents, which gives them the hands-on experience they need to efficiently respond to security breaches, which may occur less frequently.
It has also made an impact in several areas of our work. First, it has encouraged proactive risk management specifically through integrating security functions into the workflows of our Core Observability team, which provides governance policies and observability best practices. This partnership has enabled us to improve log governance, auditing, and security control rollout, which have in turn strengthened our ability to identify and manage risks within our environment.
Using established SRE best practices within our Trust and Safety team has improved our ability to efficiently triage risks to our customers and support our organization’s security posture. SRE has significantly enhanced our security response by creating a singular incident process, instead of isolating a Security Incident Response Team (SIRT) team into a separate space where they would only engage in specific scenarios. This approach led us to create detailed security incident playbooks and ensure that on-call leads and engineers are trained on how to apply the general practice of engineering incident response to security.
Structuring our organization for SRE and security
We’ve talked about some of the high-level benefits of combining SRE and security into one organization, but what does that look like in our day-to-day operations at Datadog? We merged these two disciplines to create a Security and Reliability organization, which develops the necessary tools and processes that ensure the Datadog platform is built with our customers’ trust and safety in mind. To accomplish this, we have teams within the organization that focus on security in three key areas: product, internal cloud infrastructure, and operations.
The Product team ensures the secure design of Datadog features, which includes steps like conducting design reviews, enforcing secure supply chains, and creating efficient secrets management. For example, they encourage best practices like scanning code and dependencies at each stage of the SDLC.
The Internal Cloud Infrastructure team maintains the security and compliance of the infrastructure that supports the Datadog platform. They accomplish this via Infrastructure-as-Code policies, which embed security into the way we deploy services and resources.
The Security Operations team focuses on threat detection and response, actively improves our security posture, and manages security-related incidents. They use Datadog’s unified observability to develop security-focused SLOs and correlate security events with other telemetry. They also run chaos engineering tests that verify the effectiveness of existing system safeguards. For incident response management, the team uses existing response capabilities and tools—such as runbooks, postmortems, and automation—to treat security vulnerabilities like reliability issues.
Security and SRE: better together
Embracing the symbiotic relationship between security and SRE is key to achieving a resilient and secure infrastructure in today’s rapidly evolving cloud landscape. By combining the benefits of both disciplines at Datadog, we can focus on improving risk prioritization and mitigation, enhancing system reliability, and safeguarding our systems against potential threats. Check out our documentation to learn more about Datadog’s security and incident management offerings. If you don’t already have a Datadog account, you can sign up for a free 14-day trial.