In today’s evolving cloud landscape, balancing security and compliance is becoming increasingly more challenging. Security is essential for protecting an organization’s applications, resources, and data from threats, while compliance ensures a commitment to building services that align with industry standards. Although these goals overlap as key components of a strong security posture, they require distinct approaches that can be challenging to integrate. The difficulty lies in detecting and responding to threats efficiently while also tracking that work for reporting and auditing purposes.
In Part 1 of this series, we’ll look at some of the data that is helpful for tracking your organization’s security posture in the following categories:
In Part 2, we’ll show you how Datadog bridges the gap between security and compliance by enabling you to track each of the metrics we discuss in this post.
A note on the metrics discussed in this post
Some of the metrics we’ll discuss are widely used as indicators for measuring an organization’s effectiveness in responding to security incidents. However, there is an ongoing conversation within the industry about whether approaches like Service Level Objectives (SLOs) offer a more actionable framework for gauging success and connecting security to broader operational goals. In this post, we’ll reference both traditional metrics and approaches like SLOs, but it’s important to note that there’s no one-size-fits-all solution. Organizations should tailor approaches to meet their unique goals.
Response and remediation
The first category of metrics we’ll look at focuses on how well a team responds to and remediates security incidents, such as those that are the result of denial-of-service (DoS) attacks, data breaches, unauthorized access, and vulnerability exploits. Traditional, time-based operational metrics, like the ones described in this section, provide a starting point for measuring a team’s efficiency during these events. They also help them adjust their detection systems to better differentiate anomalous from typical activity in the future.
Mean Time to Detect (MTTD)
The mean time to detect (MTTD) metric measures the average time it takes your threat detection systems—we’ll focus primarily on cloud SIEMs in this post—to identify an issue within your environment. This metric can help establish a baseline for how well your threat detection systems respond to threats. The end goal for a threat detection system is having a consistently low false positive rate, which indicates that they are configured to accurately identify both legitimately malicious and benign activity.
A consistent increase in MTTD could indicate a few issues. Misconfigured detection signals, for example, can lead to a high false positive rate and increase the time it takes to detect legitimate threats. Out-of-date (or a lack of) threat intelligence lists can affect how well your systems detect emerging or sophisticated threats. Additionally, systems that are not fully integrated with your environment can overlook critical activity captured in logs.
It can also be helpful to monitor anomalies in MTTD to get a complete picture of what affects your systems’ abilities to detect threats. A single spike, as seen in the following screenshot, can be the result of a particularly challenging or uncommon incident. For example, a threat actor may be able to successfully evade initial detection via covert methods that your threat detection systems doesn’t account for, such as password spraying.
Mean Time to Acknowledge (MTTA)
The mean time to acknowledge (MTTA) metric focuses on the time between a threat detection system’s initial detection of an issue and when it is reviewed by your security team. Similar to MTTD, MTTA can also show a need to fine-tune your threat detection systems. For example, a consistently high MTTA could indicate an issue in one of the following areas:
- Difficulty with prioritizing high-risk signals
- Understaffed security teams, which leads to delays in analyzing signals
- Threat detection systems that are generating a high number of false positives
Mean Time to Resolve (MTTR)
The mean time to resolve (MTTR) metric tracks the average amount of time it takes for a team to fully resolve a security incident after it was detected by their threat detection system. A consistently higher time to resolve could indicate a need to reassess your team’s incident response plan. For example, issues like unclear roles or a lack of training can cause confusion during an incident and extend MTTR.
A sudden but temporary increase in MTTR, as seen in the following screenshot, can be the result of a particularly complex incident that required more time to investigate. It can also shed more light on weaknesses in your incident management process that are worth looking into. For example, factors like poor coordination or a lack of training can increase the amount of time it takes to fully resolve a security incident.
SLO considerations
While these time-based metrics offer baselines for understanding how long it takes for your team to detect, acknowledge, and resolve an incident, they do not always offer insight into why it took that amount of time, how long it should take to resolve, or how to improve resolution times. For example, these metrics alone often do not account for circumstances like having only one security engineer on call, or your team becoming proficient at resolving the same recurring incident without looking into why it keeps happening.
To address these gaps, setting SLOs can help you work with proactive, measurable targets for handling incidents efficiently. Traditional, time-based metrics are still valuable for reporting purposes after a security incident, but SLOs can complement that data by giving your teams a more complete picture of their efficiency and goals. Asking certain questions about security-related SLO expectations and users, such as the following examples, can serve as a good starting point:
- Is there an expectation to resolve a certain percentage of critical vulnerabilities within a specific time?
- Do we identify our users as internal IT teams, company leadership, employees, or customers (or all of the above)?
In this section, we looked at important time-based metrics for assessing the effectiveness of your organization’s security incident response and threat detection systems. We also briefly looked at how SLOs can take this data a step further by enabling your organization to set realistic goals for incident response. Next, we’ll look at metrics that provide a high-level overview of your security posture and how it affects your organization.
Incidents and threats
The second category of metrics applies to the overall state of your environment and threat detection systems. This information is especially helpful for security audits, which rely on quantifiable data to evaluate performance and make informed decisions about system and process improvements. In addition to audits, these metrics can help you identify gaps in your security posture and determine their cause.
Intrusion attempts
Intrusion attempts reflect the number and frequency of attempts to compromise your systems. These scenarios typically include the various tactics and techniques that attackers use to gain or attempt to gain access to a system, and include events like unusual login attempts, unauthorized changes to valid accounts, resource modifications, and data exfiltration.
Monitoring the total number of intrusion attempts over a period of time can help you compare trends with changes to your systems, while tracking the frequency of attempts could highlight specific misconfigurations or vulnerabilities. For example, a sudden spike in intrusion attempts could be the result of an attacker taking advantage of a new vulnerability, while a steady increase could simply reflect your organization’s heightened profile.
In either case, monitoring intrusion attempts can provide valuable insights into which vulnerabilities to prioritize and where to implement better safeguards. In Part 2 of this series, we’ll look at how Datadog can help you monitor intrusion attempts and explore their causes.
False positive rates (FPR)
False positive rate (FPR) measures the percentage of benign signals that were incorrectly classified as malicious by a threat detection system. Monitoring FPR enables you to assess the effectiveness of your threat detection systems, with the goal being a low percentage of false positives. For example, a consistently high FPR for a cloud SIEM indicates poorly tuned signals, which can create alert fatigue and lead to overlooking serious threats.
A high FPR gives you insights into how to create high-fidelity signals for your cloud SIEMs. A signal that detects a single event, such as a failed login attempt, may not indicate anything malicious. But combining that single attempt with other behavior, such as multiple failed attempts followed by a successful login and lateral movement, can surface a threat. Building sufficient coverage for cloud SIEMs through these assessments can ensure that you are building the right detection signals for your environment.
You can also compare FPR with other metrics, like the number of security incidents and MTTA, to determine if your signals and other preventative measures are improving your organization’s security posture. For example, if the number of severe incidents, MTTA, and FPR are consistently low, that can indicate an effective security incident response.
Security incidents
Security incidents are events that will (or can) endanger the availability, integrity, and confidentiality of your systems and their data or violate security policies. It’s important to note the differences between a security incident and a security event because one requires more attention than the other. Events are occurrences that indicate a threat, risk, or vulnerability to a system, while incidents are the result of an event that will compromise a system or violate a policy. Many security events are negligible, so acknowledging the differences can help you refine the data you track.
Monitoring the number, frequency, and severity of security incidents (not security events) provides a better understanding of gaps in your organization’s security posture. For example, a steady increase in the frequency of critical security incidents could indicate a flaw in your organization’s security processes, such as not being aware of recommended security best practices like properly configuring metadata services.
On the other hand, a noticeable decline in the frequency of significant security incidents could be the result of a well-configured environment and efficient threat detection systems. You can compare this trend with those of other metrics, like response times, and your SLOs to confirm that it is a result of improved incident management. For example, satisfactory response times and threat detection systems’ false positive rates can account for a decline in significant security incidents. However, it’s important to note other factors that could contribute to a decline, such as emerging threats that create blind spots in your threat detection systems. In Part 2, we’ll look at how Datadog can help you track your security incidents and determine their root cause.
Governance, compliance, and preparedness
The final category measures your organization’s governance effectiveness, compliance, and level of preparedness. Overlooking a governance policy or compliance requirement can be costly, so having this information for routine audits and reporting purposes is essential. Tracking this data can also help you determine which improvements need to be made to your services and resources, as well as your threat detection systems.
Level of preparedness
This assessment looks at an organization’s ability to detect, contain, and recover from security incidents. While not a standard metric like the response and remediation metrics we talked about earlier, the level of preparedness can be measured by looking at a few factors, including:
- The frequency of incident management trainings that highlight security response
- The frequency of security-focused exercises, such as red team and purple team exercises
- The quality of incident response plans, such as well-defined roles and steps for communicating with customers
- The ability to maintain a history of events, such as those captured in audit logs
Comparing this information with other metrics, such as trends in the number of incidents, intrusion attempts, and FPR can give you a better understanding of your organization’s overall level of preparedness. It can also help you develop a grading system that supports your goals. For example, a moderate-high level of preparedness could mean that all responders have completed incident management training, your organization conducts 1–2 regular audits a year, and your cloud SIEM has a consistently low FPR.
Policy for security and compliance
In addition to evaluating your organization’s level of preparedness, it’s important to track your adherence to established security policies and industry standards, such as CIS, GDPR, HIPAA, and PCI-DSS. Tracking this information not only gives you a baseline for reducing security risks in your environment but also protects your organizations from costly lawsuits and compliance violations. Compliance metrics can be customized to fit your organization’s goals, but we’ll look at a couple of examples to consider.
First, you can look at the severity levels of flagged misconfigurations. Not all misconfigurations need to be addressed immediately, so knowing their status, as seen in the following screenshot, will help you prioritize the most pressing violations. Critical misconfigurations would require an immediate, concentrated effort to resolve before the high priority ones.
Second, tracking which resources either pass or fail recommended configurations gives you an overview of how well your environment complies with a particular framework or benchmark. This can serve as a good starting point for keeping your resources secure. A high number of resources that pass benchmarks indicate efficient deployment, change management, and configuration processes.
A higher percentage of failed misconfigurations, on the other hand, could indicate one of the following issues:
- An overreliance on default settings when deploying new resources
- Changes made during incidents that were not documented
- Poor visibility into resource changes
- A lack of training on compliance policies
You can take these elements of monitoring policy compliance—severity levels and pass/fail status—a step further by establishing security baselines, which are the group of configuration settings that are required for your environment. These baselines can either be customized for your resources or based on industry-standard recommendations.
For example, you can create a security baseline that’s designed to apply the minimum compliance requirements across all of your Kubernetes environments. With it, you can assign severity levels to certain configurations that you want to always enforce, such as requiring HTTPS connections on the API server, in addition to Service Level Agreements (SLAs) for remediation.
Understand your security posture with these key metrics
In this post, we looked at the key metrics for monitoring your organization’s security posture. We also looked at how SLOs can supplement these traditional security metrics to create a comprehensive strategy for handling incidents efficiently. In Part 2, we’ll discuss how you can use Datadog to monitor these data points and strengthen your organization’s security posture.
Acknowledgments
We’d like to thank Kendra Ash and Elise Burke for their invaluable feedback on this article.