How We Use Datadog for Detection as Code | Datadog

How we use Datadog for detection as code

Author Christine Le
Author Christopher Camacho

Published: October 11, 2024

Detection as code (DaC) is a methodology that treats threat detection logic and security operations processes as code. It involves applying software engineering best practices to implement and manage detection rules and response runbooks. This approach addresses many of the pain points associated with traditional security operations, such as:

  • Version control: As the number of detection rules increases and creators change over time, it is vital to keep a source of truth, have transparency in rule changes, and understand motivations behind the changes. Defining detection rules as code allows teams to use version control systems like Git, so that multiple team members can contribute, review, and improve detection logic in a centralized and coordinated manner.
  • Consistency of review and approval: DaC standardizes detection logic, ensuring consistency and reducing the chances of gaps in security coverage. Passing tests and quality checks before releasing to production saves security responders from floods of faulty alerts or inactionable signals.
  • Maintenance at scale: As organizations grow, so do the volume and complexity of their security data. Rules defined as code support multi-file edits and large redeployments (e.g., pushing the same set of rules to multiple environments) in a single step, making it easier to scale detection efforts in line with organizational growth.

At Datadog, we’ve also experienced these same pain points, which has led us to adopt the DaC methodology. Our Threat Detection team dogfoods our own products—including Cloud SIEM, Application Security Management (ASM), and Cloud Security Management (CSM)—to implement DaC across our complex, microservice-based infrastructure. In this post, we’ll walk through how we use the Datadog platform to implement and maintain DaC, including how we write detection rules, our repository structure, our CI/CD pipeline, and our detection development flow.

How we write detection rules

Our Threat Detection team uses Datadog log query and Agent expression syntax to write detection rules that run across Cloud SIEM, ASM, and CSM. These rules detect indicators of potential security issues in our logs (Cloud SIEM), application runtimes (ASM), and workload runtimes (CSM), respectively.

We write these detection rules in Terraform using Datadog’s Terraform provider. Terraform creates the detection rules as resources, similar to software infrastructure. Adopting this infrastructure-as-code approach for detection rules allows the team to:

  • Push a rule once and see the rule deployed across multiple organizations
  • Define rules in a single place and selectively choose which organizations rules are deployed to
  • Prevent drift and guarantee a state of truth

Our repository structure

Our repository houses the Terraform resources and tooling to manage detection rules at Datadog. The repository is divided into the following directories: rules, organizations, and tests.

├── rules
├── organizations
├── tests
├── .gitlab-ci.yml
└── README.md

Rules

All rules are defined as individual Terraform files, and each rule declares its product type (i.e., Cloud SIEM, ASM, or CSM). The rules are also grouped by data source. We define data source as the origin of the logs or events that the rules are based on. Here’s a glimpse of what that looks like:

rules
├── asm				// Application Security Monitoring
│   ├── credential_stuffing.tf
│   ├── config.tf
├── azure				// Azure control plane
│   ├── permission_elevation.tf
│   ├── config.tf
├── cws				// Cloud Security Management (runtime)
│   ├── modify_authorized_keys.tf
│   └── config.tf
├── github				// Github activity
│   ├── clone_repos.tf
│   └── config.tf
└── k8s				// Kubernetes control plane
    ├── access_secrets.tf
    └── config.tf

Within each data source’s directory is a config.tf Terraform module that allows us to reuse the same rule definitions and deploy them to multiple Datadog instances.

// config.tf

terraform {
  required_providers {
    datadog = {
      source = "DataDog/datadog"
    }
  }

  required_version = ">= 1.0.3"
}

Within a rule’s Terraform file, the same query used in Datadog log search or Agent expression is embedded within a resource definition and set with conditions for generating signals.

For example, the rule k8s/access_secrets.tf runs within Cloud SIEM and catches failed attempts to retrieve secrets within Kubernetes. The detection logic lives within the query block. The case block sets the threshold for generating a signal (secrets_forbidden_access > 1) and severity (low) for any generated signals. The options block establishes the evaluation window (300 seconds), keep alive (600 seconds), and max lifetime of a signal (900 seconds).

// k8s/access_secrets.tf

resource "datadog_security_monitoring_rule" "access_secrets" {
  name    = "Kubernetes - Failed attempts to access secrets"
  type    = "log_detection"
  enabled = true

  message = <<EOT
  User {{@usr.id}} attempted to list secrets in {{kube_namespace}}.{{k8s_cluster_fqdn}} repeatedly.

  EOT

  query {
    name            = "secrets_forbidden_access"
    query           = "source:kubernetes.audit @objectRef.resource:secrets @http.method:(list OR get) @http.status_code:403"
    aggregation     = "count"
    group_by_fields = ["kube_namespace", "k8s_cluster_fqdn", "@usr.id"]
  }

  case {
    name      = "secrets_forbidden_access"
    status    = "low"
    condition = "secrets_forbidden_access > 1"
  }

  options {
    evaluation_window   = 300 	// 5 minutes
    keep_alive          = 600 	// 10 minutes
    max_signal_duration = 900 	// 15 minutes
  }

 }

We can also create suppression rules that filter out known legitimate activity to reduce noise in our signals. These can be created either within the same file as the detection they correspond with or as a separate file.

For example, we’ve defined the following suppression block within k8s/access_secrets.tf as a second resource. In this case, the suppression is specific to a single rule, so we keep the detection and its suppression in the same file to make it easier for our security engineers to understand. The suppression will not generate a signal if the activity is triggered by a service account called redacted (@usr.id:system:serviceaccount:redacted).

locals {
  suppressions_access_secrets = [
    "@usr.id:system:serviceaccount:redacted"
  ]
}

resource "datadog_security_monitoring_suppression" "access_secrets" {
  count                = length(local.suppressions_access_secrets)
  enabled              = true
  name                 = "Kubernetes - Failed attempts to access secrets ${count.index + 1}"
  rule_query           = "ruleId:${datadog_security_monitoring_rule.suppressions_access_secrets.id}"
  data_exclusion_query = local.suppressions_list_secrets_attempts[count.index]
}

Note: We create long-lived or permanent suppressions using Terraform. However, we recognize there are times when convenience or urgency takes precedence. For this reason, we create temporary or short-lived suppressions (e.g., ones that will be in place for just four hours, or during an expected maintenance window) directly in the Datadog UI. Similarly, responders occasionally use the one-click option within the UI to prevent a problematic rule from creating floods of alerts.

Organizations

We monitor multiple environments and forward data to a number of different Datadog instances, which we refer to as organizations or orgs. All of our Terraform backend configurations live in the repository’s organizations directory. The configurations specify the Terraform backend and dictate which rules are deployed within each Datadog organization.

Starting out, we had a single Terraform state file for all of our detection rules. We observed a repeated pattern as our environments and number of detections grew: A failing Terraform deployment would block all subsequent deployments. In other words, if Terraform runs into an error while applying a rule change, all subsequent changes to any rule will be blocked until the error is fixed.

In addition to quality checks that catch errors early, we chose to create a separate Terraform state for each set of rules within an org. This layer of isolation ensures that failing rule deployments in one org do not affect rule deployments occurring in other orgs.

organizations
├── prod
│   ├── asm
│   │   ├── backend.tf
│   │   └── main.tf
│   ├── azure
│   │   ├── backend.tf
│   │   └── main.tf
│   ├── cws
│   │   ├── backend.tf
│   │   └── main.tf
│   └── k8s
│       ├── backend.tf
│       └── main.tf
└── staging
    ├── asm
    │   ├── backend.tf
    │   └── main.tf
    ├── azure
    │   ├── backend.tf
    │   └── main.tf
    ├── cws
    │   ├── backend.tf
    │   └── main.tf
    └── k8s
        ├── backend.tf
        └── main.tf

We opted to use main.tf to tell Terraform where to find the respective rules within the repository. Individual rules that we do not want deployed to a particular org are configured within main.tf. backend.tf specifies the remote backend location.

// main.tf

module "azure" {
  source = "../../../rules/azure"			# Path is relative to main.tf
}

# Individual rules that are not to be deployed within this org
# rules map is optional and does not have to be populated
rules = {
    permission_elevation = false
}

Tests

For supported use cases, we define end-to-end tests for our rules and group them by data source in the tests directory. These are tests that are triggered by our internal testing service, built on top of Stratus Red Team and Threatest.

The following is an example of an end-to-end test file. The test file tells our testing service to schedule a job each hour (schedule: "0 * * * *") and run a Linux command to change the timestamp of the authorized_keys file (cd ~/.ssh && touch authorized_keys). The testing service should then expect a signal called SSH Authorized Keys Modified (OOTB clone) to be generated within our dedicated Datadog testing org (123456). If the testing service observes the signal, then the detection rule works as intended. However, if the testing service does not observe the signal, an alert is raised that prompts a detection engineer to follow up. The lack of signal indicates an issue in either logging or the detection rule itself.

// tests/cws/ssh-authorized-keys.yaml

schedule: "0 * * * *" 	# every hour
workerType: cws
datadogOrgs:
  - 123456			# example org ID
threatest:
  expectedRuleName: "SSH Authorized Keys Modified (OOTB clone)"
  timeout: 10m
  detonators:
    - type: local-command
      command: "cd ~/.ssh && touch authorized_keys"

Our CI/CD pipeline

We use GitLab for our continuous integration and continuous delivery (CI/CD) pipelines. All of our GitLab pipelines and rules are defined in .gitlab-ci.yml. The following is a breakdown of the various stages of our GitLab pipelines that define how we lint, test, and deploy our detection rules:

  • .pre
    • Configures our GitLab Pipeline with Datadog Tracing via CI Visibility, in order to view our pipelines’ performance and trends in the Datadog UI.
  • lint
    • Checks for expected syntax from the Terraform provider and enforces tagging.
  • test
    • Spins up and tears down a sandbox environment to test detection rules.
    • synthetic-logs
      • Job that triggers detection rules over sample audit log data that needs to be manually provided via a JSON file path.
    • apply
      • Job to set up the sandbox environment with the rule(s) being tested.
    • destroy
      • Job to tear down the sandbox environment with rule(s) previously deployed for testing.
  • check
    • Detects which organizations will be impacted by a rule change. If changes are detected, a GitLab artifact with a list of modified Terraform modules (i.e., detection rules) will be generated and utilized by downstream jobs, so that we can apply changes only in Datadog organizations where the affected modules are deployed.
  • plan
    • Generates a Terraform plan when rules are modified on a feature branch. This stage has a job defined per Datadog organization we deploy rules to. terraform apply takes place only when merging to the main branch of our repository.
  • deploy
    • Run terraform apply on the impacted organizations, when the changes are merged to our main branch. Similar to the plan stage, this stage has jobs defined per Datadog organization. These jobs also have an equivalent schedule job that periodically runs terraform apply to prevent drift and to detect any tampering of our detection rules in the Datadog UI.
    • This stage also generates a new JSON file, capturing the MITRE ATT&CK technique coverage for our detection ruleset. The JSON file powers our internal deployment of MITRE ATT&CK Navigator.

Our detection development flow

Now that we’ve covered the general structure of our repository and the different stages involved in our CI/CD pipeline, let’s take a look at how it all comes together when developing a detection rule on our team.

This starts off as a Log Search query, where the engineer filters and narrows down the search results of the action(s) we’d like to detect.

Log search query in Datadog

To test our query, we craft a JSON blob or utilize existing test data to forward to the logs intake endpoint. If supported, we also test with Stratus Red Team to validate the detection will fire on real actions occurring in our environment.

Test in Stratus Red Team

Once we have run the initial tests and are satisfied with the results of the query, we create a detection rule directly in the Datadog UI.

Add detection rule from log query in Datadog
Detection rule creation step in Datadog

We then export the rule itself as a Terraform file directly from the UI, which is the expected format in our repository.

Export detection rule from Datadog as a Terraform file
Terraform file created from Datadog detection rule

Finally, we create a pull request in our rules repository, which triggers a set of checks, including linting, testing, and generating a Terraform plan. Upon approval during code review, we then merge our new rule into the repository, which triggers another set of CI/CD jobs to deploy our rule to the specified Datadog orgs.

Detection rule finding in Datadog Cloud SIEM

Simplifying and centralizing detection rule creation with Datadog

In this post, we discussed the key benefits that led us to adopt a detection as code methodology at Datadog. We walked you through how we put DaC concepts into practice from repository structure to CI/CD and our detection development flow. Along the way, we highlighted the critical role that Datadog Cloud SIEM, ASM, and CSM play in helping us organize and simplify the rule creation process so we can create the right detections and implement them at scale across our organization.

If you want to get started building detection rules in Datadog using Cloud SIEM, ASM, or CSM, check out our documentation. If you’re new to Datadog, sign up for a .