
Introducing ARFBench: A time series question-answering benchmark based on real incidents
ARFBench is a time series question-answering benchmark built from real Datadog incidents to evaluate how well AI models can reason about anomalies.
Blog

ARFBench is a time series question-answering benchmark built from real Datadog incidents to evaluate how well AI models can reason about anomalies.

Learn how Datadog verifies AI-generated systems at scale using deterministic testing, formal methods, and observability-driven feedback loops.

Learn how Datadog achieves fully autonomous, verified code optimization in production using LLM-driven evolution, formal verification, and live traffic validation.