VLLM Observability & Monitoring | Datadog

vLLM Observability & Monitoring

Gain comprehensive visibility into the performance and resource usage of your LLM workloads.

dg/vllmheader

A unified monitoring platform provides full visibility into the health and performance of each layer of your environment at a glance. Datadog allows you to customize this insight to your stack by collecting and correlating data from more than 800 vendor-backed technologies, all in a single pane of glass. Easily monitor your underlying infrastructure, supporting services, applications alongside security data in one centralized monitoring platform.

 

Ensure Fast, Reliable Responses to Prompts

  • Visualize critical performance metrics like end-to-end request latency, token generation throughput, and time to first token (TTFT) with an intuitive OOTB dashboard
  • Identify and resolve infrastructure issues or resource constraints to ensure your LLM application remains fast and reliable, even under heavy load
  • Adjust resource allocation to meet demand and keep your LLMs performing at their best with end-to-end visibility
dg/vllm2

Optimize Resource Usage and Reduce Cloud Costs

  • Prevent over-provisioning by monitoring key LLM serving metrics like GPU/CPU utilization and cache usage
  • Reduce idle cloud spend while ensuring LLM workloads maintain high performance by tracking real-time resource consumption
  • Balance performance and cost-efficiency by rightsizing infrastructure and avoiding unnecessary scaling events
dg/vllm3

Detect and Address Critical Issues Before They Impact Production

  • Detect issues early by proactively monitoring key LLM application performance metrics with preconfigured Recommended Monitors
  • Prevent delays or interruptions by tracking metrics like queue size, preemptions, and requests waiting in real time
  • Resolve potential problems before they impact performance with actionable alerts on predefined thresholds
dg/vllm4

The Essential Monitoring and Security Platform for the Cloud Age

Datadog brings together end-to-end traces, metrics, and logs to make your applications, infrastructure, and third-party services entirely observable.

Platform Diagram

Next-generation ML Monitoring

Monitor and your entire machine learning stack with Datadog.

watchdog-apm-illustration.png

AWS Trainium & Inferentia

Monitor and optimize deep learning workloads running on AWS AI chips

tracesearch-apm-illustrationv2.png

OpenAI

Monitor token consumption, API performance, and more.

servicemap-apm-illustration.png

NVIDIA DCGM Exporter

Gather metrics from NVIDIA’s discrete GPUs, essential to parallel computing.

Loved & Trusted by Thousands

Washington Post logo 21st Century Fox Home Entertainment logo Peloton logo Samsung logo Comcast logo Nginx logo

ML Monitoring Resources

Learn about how Datadog can help you monitor your entire AI stack.

Datadog AI Monitoring Starter Kit