In Part 1 of this series, we looked at metrics that offer insight into the effectiveness of your threat detection systems and team response during a security incident. With this information, you have a starting point for identifying gaps in your organization’s security posture and the ability to respond to threats. The benefit of having this visibility is that it not only enables your teams to continually monitor their services for security issues but it also minimizes the potentially costly risks of not adhering to industry regulations, frameworks, and benchmarks.
In this post, we’ll look at how Datadog provides visibility into your organization’s security posture across three key areas: response and remediation, incidents and threats, and governance, compliance, and preparedness. With capabilities for log-based threat detection, incident management and response, and compliance monitoring, Datadog enables you to prioritize the security-focused metrics, goals, and events that matter the most to your organization.
Response and remediation
As mentioned in Part 1, this first category looks at how quickly your threat detection system, such as a cloud SIEM, identifies a legitimate security event as well as how your teams respond to it. Cloud SIEMs are critical for security monitoring because they aggregate logs from your applications, resources, cloud services, and more to surface threats. Tracking metrics like mean time to detect (MTTD), mean time to acknowledge (MTTA), and mean time to resolve (MTTR) can help you assess your cloud SIEM’s effectiveness.
Mean time to detect, acknowledge, and resolve
Datadog Cloud SIEM automatically tracks MTTD, MTTA, and MTTR, so you can monitor how quickly signals are generated, reviewed, and archived for resolution. They are also included in a built-in overview dashboard, enabling you to review each one alongside other data, such as signal trends.
Because Cloud SIEM relies on logs, collecting and querying the right ones is important for accurate metrics and signals. However, fine-tuning log collection and querying can involve a tradeoff between cost and visibility. Datadog Flex Logs decouples the cost of log storage from the cost of querying, providing you with both short- and long-term log retention without sacrificing visibility. Flex Logs is ideal for collecting high-volume logs (such as the ones you need for high-fidelity detection signals) and retaining them for security audits and investigations during incidents.
Tracking SLOs
In Part 1, we also discussed how SLOs can complement standard response metrics by helping you set goals for your teams. You can then compare these goals with your time-based metrics to determine how well your team is hitting SLO targets. Based on factors like the fidelity of your signals and the impact of the detected issue, you can create reasonable response time goals.
For example, you can create an SLO in Datadog that focuses on triaging critical security events within a specific timeframe. With high-fidelity signals that indicate immediate risks to your customers, detections should be near-instantaneous thanks to Datadog Cloud SIEM and its integrations. For acknowledging a generated signal, you can set a goal that ensures that responders review and begin investigating them within a reasonable timeframe, such as 10 minutes.
The following example illustrates that goal using the datadog.security.siem_signal.time_to_acknowledge
metric, filtered by security events marked as critical by Cloud SIEM (those with severity:critical
):
While a simple example, this SLO demonstrates how these factors work together to encourage faster response times and reliable detection signals. High-fidelity signals minimize noise and false positives, which enables faster detection of legitimate threats. Basing an SLO on those signals not only sets a clear target but also encourages your team to continually improve signal accuracy. Building suppression lists, for example, is one way to improve signal fidelity.
This practice also applies to other metrics, such as MTTR. Incorporating accurate signals as part of a “mean time to resolve” SLO, for example, ensures that the time to mitigate a security event won’t be stalled by the need to verify the signal’s validity.
Incidents and threats
In Part 1, we also covered the importance of tracking the number of intrusion attempts and security incidents, as well as your threat detection systems’ false positive rate (FPR). These metrics provide you with a better understanding of the different types of incidents that are affecting your environment, as well as how your threat detection systems play a role in identifying the events that can lead to one.
Intrusion attempts
An intrusion attempt includes any malicious activity that could potentially compromise a system, from gaining access to a valid account to changing resource configurations. You can track this information via generated detection signals, which will flag suspicious activity within your environment. Check out our documentation to learn about the different types of detection signals Datadog generates for logs, application activity, workload activity, misconfigurations, and more.
Cloud SIEM detection rules are mapped to the MITRE ATT&CK® framework to give you a better understanding of each of the tactics and techniques attackers use. This means that you can drill down to the specific types of intrusion attempts you want to track across detection rule types, such as Log Detection and Signal Correlation for Cloud SIEM. For example, with the Cloud SIEM ATT&CK Map, you can map coverage for detection rules against the Valid Accounts technique:
To take it a step further, you can also build custom dashboard widgets based on your detection coverage. This step will help you determine the initial steps that attackers take in their attempt to compromise your systems, such as taking advantage of public-facing applications or abusing account credentials. The following dashboard widget query shows the number of generated signals mapped to the various techniques related to the Initial Access tactic:
With this information, you gain a better understanding of your security coverage and the specific ways attackers are targeting your systems.
False positive rate (FPR)
In order to ensure that your Cloud SIEM detection signals accurately identify intrusion attempts and other types of threats, it’s important to track your threat detection systems’ FPR. This step requires accurately updating a signal with either the “open”, “under review”, or “archive” status, which you can do in Datadog as part of the triage process. In the example screenshot below, we are archiving a generated signal that we found to be a false positive:
You can then query the number of false positive signals in order to determine the rate. For example, the following dashboard widget query uses @workflow.triage.archiveReason:false_positive
to calculate FPR, filtering out all lower-priority signals so you have only the data that is meaningful and impactful:
Security incidents
One of the challenges in evaluating your organization’s security posture is knowing how to use incident data—such as how a security incident was detected, how long it took to resolve it, and which attacks were involved during that time—to gain a better understanding of how threats affect your environment. Some threats are easier to detect than others, some happen more frequently, and some are more severe, so they all require different levels of response. In Part 1, we discussed the importance of tracking this kind of data with metrics like the number, frequency, and severity of security incidents.
Datadog Incident Management automatically includes these metrics and more for declaring, investigating, and resolving security incidents. Incident Management is integrated with detection signals, so you can easily declare a security incident directly from a Cloud SIEM signal, for example. This enables you to easily transition from investigating a security event to coordinating an incident, bringing in responders to quickly investigate and resolve it before it affects customers.
Datadog also provides additional data for drilling down into specific aspects of an incident. For example, the following query looks at the total number of incidents with incident_type:security
, broken down by severity level:
Information like this gives you a better understanding of the different types of security incidents that affect your environment. A consistently high number of SEV-1 security incidents, for example, could indicate significant gaps in an organization’s overall security posture, such as a lack of defense-in-depth controls or inadequate visibility into resource configurations, which we’ll look at next.
To learn more about using data like this to efficiently respond to and mitigate issues, check out this post about how Datadog manages incidents.
Governance, compliance, and preparedness
This final category looks at several factors that can help you measure your organization’s level of preparedness and overall compliance, such as the frequency of training and security exercises and the completeness of compliance reports. Compliance and preparedness can apply to different areas of your environment. For example, compliance often focuses on configurations for individual resources, like EC2 instances. Preparedness, on the other hand, may look at the service level of your environment, which groups together related endpoints, workloads, queries, and teams.
In this section, we’ll look at how you can monitor compliance at the resource level and preparedness at the service level of your environment.
Security and compliance
Given the complexity of cloud environments, it’s challenging to get a high-level view of all of your resources and their configurations. This lack of visibility risks overlooking misconfigurations that could either leave your environment vulnerable to threats or liable for violating a compliance standard.
Datadog CSM Misconfigurations automatically checks your environment resources against industry frameworks and benchmarks and offers several built-in compliance reports for quickly identifying gaps in your environment. The following report illustrates how Datadog checks your resource configurations against the CIS Benchmarks for AWS:
Each report includes a posture score, which takes into account the number of pass/fail misconfigurations in your environment and their severity. With this data, you can quickly glance at different areas of your environment to determine where to prioritize improvements.
The checks included in these built-in reports can be used in custom frameworks, where you can create security baselines for your environment. As described in Part 1, you can create a security baseline that’s designed to apply custom compliance requirements across specific areas of your environments, such as Kubernetes clusters.
Preparedness
Rather than looking at a specific metric, measuring preparedness requires evaluating the status of several different checkpoints. For example, your organization’s level of preparedness may include teams conducting regular security exercises and ensuring that their services adhere to specific security or compliance requirements. But like trying to monitor all of the individual resources within your environment, efficiently scaling these security checks is nearly impossible in cloud environments.
Addressing this challenge requires the ability to easily apply the same preparedness checks across your services and monitor their status. Datadog Scorecards helps simplify this process as you scale your environments. For example, you can use scorecards to track steps of a necessary but complex security review.
Together, Datadog CSM Misconfigurations and Scorecards enable you to create a baseline for governance, compliance, and preparedness while identifying areas of improvement for both your resources and services. This visibility not only strengthens your organization’s security posture but also enhances other areas of your security response, including:
- Minimizing the number of intrusion attempts, especially those that lead to serious incidents
- Informing the types of Cloud SIEM detection signals you need to ensure coverage and reduce FPR
- Revealing ways to reduce the mean time to detect, acknowledge, and resolve a security event
- Supporting the creation of reasonable SLOs
Putting it all together with Datadog
In this post, we looked at several ways Datadog connects each of the pieces necessary for tracking an organization’s security posture, from metrics that measure your existing state to data that informs goals for improvement. Check out our documentation to learn more about Cloud SIEM, CSM, Incident Management, SLOs, and Scorecards. If you don’t already have a Datadog account, you can sign up for a free 14-day trial.
Acknowledgments
We’d like to thank Kendra Ash and Elise Burke for their invaluable feedback on this article.