Monitor System Performance Across Longer Time Frames With Historical Metrics | Datadog

Monitor system performance across longer time frames with historical metrics

Author Mitheysh Asokan

Published: 11月 27, 2024

In Datadog, custom metrics enable you to monitor any data that is vital to your business. These may include application and infrastructure KPIs like latency and error rates, as well as user behavior and usage metrics such as dollars spent or trips taken.

Datadog now accepts historical metric values for your custom metrics—i.e., metric values with a timestamp exceeding one hour prior to the time you submit the metric. Enabling historical metrics ingestion can help support long-term data analysis for use cases such as:

  • Outage recovery: A network outage may temporarily impede your environment’s ability to submit metrics. It can be beneficial to store this data locally and backfill this data for comprehensive monitoring and trend analysis.
  • Data science: Machine learning (ML) models often generate predictions or analytics based on large amounts of historical data. If you’re monitoring a ML system, you may need to ingest training data with values that have historical timestamps.
  • Data correction: For a variety of reasons—whether simple human error or incorrect assumptions in your ML model that have biased your data, for example—you may need to backfill erroneous metric values with the correct, appropriately timestamped results.
  • IoT: IoT architectures rely on devices with limited, periodic data collection intervals. This may require backfilling event-based telemetry to avoid large gaps in metrics.
  • Transit: Transport vessels that emit data may lose connectivity during their voyage. Metrics emitted from these vessels may need to be backfilled.

In this post, we’ll show you:

How to enable historical metrics ingestion

Let’s say you are a DevOps engineer tasked with ensuring your service is prepared to backfill metrics in the event of a network outage. Configuring historical metrics in Datadog enables you to submit data values that were emitted during the outage period, with timestamps that accurately represent the intended time of submission.

To ensure you are able to backfill this data, you can enable historical metrics ingestion for a single custom metric, as shown below:

Enable historical metrics from the Metrics Explorer in Datadog

Alternatively, you can bulk-enable historical metrics ingestion for multiple metrics that share a common metric namespace, saving you configuration overhead.

Bulk-enable historical metrics

With historical metrics ingestion enabled, you can submit and backfill data with timestamps that are as old as your account’s metric retention period (the default is 15 months). So if there’s a network outage that prevents you from submitting data to Datadog, you can still submit metric values hours or even days after the metrics were supposed to be emitted. With historical metrics ingestion, you can monitor the totality of your metrics in Datadog, without any gaps over time.

How to track and understand historical metrics usage and volumes

Your business may rely on hundreds or thousands of custom metrics, and it’s important to be able to easily see which metrics you have enabled for historical metrics ingestion. For example, as you are configuring historical metrics in order to strengthen outage readiness, you may want to quickly identify which metrics don’t have historical data ingestion enabled in order to ultimately enable this feature and minimize information loss.

The Metrics Summary page in Datadog now includes a Historical Metrics facet box, which lets you easily filter for metrics based on their historical data ingestion status. Being able to easily identify these metrics allows you to quickly understand if historical data isn’t yet available on any metric you are interested in, so you can speed up your troubleshooting process by enabling historical data ingestion on that metric for future investigations.

Historical Metrics facet box in Metrics Summary

Additionally, Datadog automatically generates a metric, datadog.metrics.tracking.historical_metrics_ingestion, which provides key insights pertaining to your historical metrics configuration and volumes. You can group by metric_name when querying this metric to create a list of metrics that have or don’t have historical data ingestion enabled.

You can filter the metric by enabled:false to see names and volumes of metrics that are sending historical metrics to Datadog but have not had historical metrics ingestion turned on.

See which metrics are not sending historical data to Datadog

Tracking this metric can be important, as it could indicate metrics being emitted from your environment are experiencing delays and need to report historical metric values.

For example, let’s say your organization adheres to strict real-time analytics (RTA) requirements. You want to know quickly if an infrastructure issue or application performance degradation is causing metrics to fall out of compliance with your RTA standards. An unexpected increase in late data volumes for metrics that do not have historical metrics ingestion turned on in Datadog could indicate that a network outage, IoT device connection error, or other issue is delaying metrics emission. You can create a monitor for the query string datadog.metrics.tracking.historical_metrics_ingestion{enabled:False} that alerts you of a significant increase in late data volumes for metrics without historical metrics ingestion enabled. Once alerted, you can investigate and remediate the underlying issue in your environment that is causing delays.

Start using historical metrics ingestion today

Historical metrics ingestion is now available in Datadog. You can get started configuring which custom metrics you want to collect historical metrics for in the Metrics Summary page. This will immediately give you access to submit values on your custom metrics with older timestamps.

To get started using historical metrics, check out our documentation. If you’re new to Datadog and want to monitor your logs, metrics, distributed request traces, and other telemetry in a unified platform, you can start a 14-day .