Alerting 101: Timeseries Metric Checks | Datadog

Alerting 101: Timeseries metric checks

Author John Matson
@jmtsn

Published: 10月 2, 2017

In the previous article, we covered four types of alerts that evaluate the instantaneous status of different infrastructure components. In this post we’ll explore alerts that trigger on timeseries metrics and discrete events. The great benefit of these alerts is that they can evaluate not just instantaneous values, but also trends over time:

Threshold alerts

Threshold metric checks

Along with basic availability checks, a threshold alert is one of the simplest kinds of alerts. These alerts trigger whenever the monitored metric goes above (or below) a user-defined threshold. An alert may have multiple thresholds with different responses—for example, a warning threshold that posts a message in an ops channel, and a critical threshold that pages an on-call responder.

Importantly, a threshold alert need not be an absolute red line—you can include a time component in your alert evaluation to avoid false positives caused by momentary blips. For instance, you might wish to alert only if a metric’s value hovers above a given threshold for five minutes or more, or if the average value of the metric over a 15-minute window exceeds a set threshold.

When to use threshold alerts

WhatWhyExample
Metrics with SLAsTo surface unacceptable performance immediatelyp95 application response time
Normalized metrics (percentages or fractions)To identify critical resource constraintsPercent disk available

When to use something else

WhatExampleInstead use…
Metrics with a variable or trending baselineWeb app requests per secondChange alert or anomaly detection to account for expected fluctuations

Change alerts

Percent or absolute change metric checks

Change alerts evaluate the delta or percentage change in a metric over a certain time interval. Change alerts can notify you of issues such as a large-magnitude drop in database queries processed, as compared to recent values. These alerts are useful for identifying sudden, unexpected changes in metrics where the baseline is highly variable over longer timespans, making it difficult to define a “normal” range for the metric.

When to use change alerts

WhatWhyExample
Metrics with variable or slowly shifting baselinesTo isolate unexpected spikes or drops from normal changesUser count for a growing web app

When to use something else

WhatExampleInstead use…
Metrics with a wide range of acceptable valuesCPU usage on app serversThreshold alert to ensure that CPU does not remain elevated for extended periods
Metrics with small or zero baseline values5xx error count in HTTP serverThreshold alert to avoid triggering on trivial increases that are large on a percentage basis (e.g., 1 error/second to 3 errors/second)

Outlier alerts

Outlier detection on timeseries metrics

Outlier detection (not to be confused with anomaly detection) tracks deviations from expected group behavior, whether that group comprises hosts, containers, or other units of infrastructure. An outlier alert evaluates metric values from each member of the group, triggering if one or more individuals deviates significantly from the rest.

When to use outlier alerts

WhatWhyExample
Metric values that should be relatively level across the members of the groupTo identify imbalances caused by an internal failureData stored per Cassandra database node
Work metrics for distributed systemsTo isolate individual components that are failing to process work effectivelyError rate per application server

When to use something else

WhatExampleInstead use…
Metrics reported by heterogeneous groupsResource metrics (e.g. free memory) across disparate instance typesThreshold alerts on normalized values (e.g. percent memory available)
Metrics that can easily be skewed by random distributionsp99 latency for individual app serversThreshold alerts on aggregate p99 latency at the service level
Metrics from ephemeral infrastructure componentsThroughput metrics for short-lived app containersChange alerts on aggregated, service-level metrics

Anomaly alerts

Anomaly detection on timeseries metrics

Whereas outlier detection looks for deviations from group behavior in the moment, anomaly detection looks for deviations from recent historical trends. An anomaly alert can account for seasonality (such as daily traffic patterns), allowing you to get notified only when metric peaks or drops cannot be explained by normal, periodic fluctuations.

When to use anomaly alerts

WhatWhyExample
Metrics with expected temporal patternsTo isolate problematic changes from normal fluctuationsLoad balancer throughput for a user-facing application
Metrics with a long-term directional trendTo set robust alerts on metrics with a steadily shifting baselineTransactions processed on an expanding e-commerce platform

When to use something else

WhatExampleInstead use…
Metrics with unpredictable spikes or dipsThroughput for intermittent data-processing jobsEvent alert for absence of job completion

Event alerts

Alert evaluating multiple events over time

Unlike metric alerts, which evaluate a continuous stream of timeseries data, event alerts trigger on discrete occurrences, which may be quite rare. For instance, you might wish to trigger an alert whenever a nightly batch data-processing job fails to complete. Importantly, events can carry much more context than a single timeseries datapoint can. For example, an event generated by a failed run of a configuration management tool can include the actual error message returned, as well as a link to related events in the past.

Even though individual events are discrete, you can still monitor their occurrence over time by counting instances of an event over a particular timespan. For example, you can set an alert to fire whenever a particular service restarts more than three times in a five-minute interval.

When to use event alerts

WhatWhyExample
Completion of critical actionsTo ensure that scheduled or intermittent work is carried outNightly build did not complete
Unexpected or forbidden activityTo monitor for possible breaches or abusesRepeated login failures

Composite alerts

Composite alerts for more complex alerting logic

Composite alerts allow you to build more complex evaluation logic into your alert definitions. With a composite alert, you can specify that an alert fires if and only if a number of specific conditions are met. For example, you might want to alert on multiple indicators of service health, such as the locking rate in your database and the latency of your query service, where correlated trends can point to recurring issues. Or you might want to set a threshold alert on the length of your messaging queue, but withhold an alert shortly after a service restart, when a brief surge in queue length is expected.

When to use composite alerts

WhatWhyExample
Multiple indicators that together point to a particular problemTo alert on known issues that manifest with multiple symptomsIncreased 404 error rate combined with surge in web requests per second
Issues that have a routine explanationTo withhold alerts on deviations that are expected under certain circumstancesSpike in traffic to e-commerce site after flash sale announcement

Stay alert!

In this post we have covered several alert types that you can set on metrics or events from your infrastructure. By understanding the ideal use cases for each of these alert types, as well as the status checks covered in the companion post, you can ensure that you are notified of critical issues in your environment as quickly as possible.