Real-Time NVIDIA GPU Monitoring | Datadog

Real-Time NVIDIA GPU Monitoring

Track the performance of all your GPU workloads, regardless of whether they are containerized, hosted locally, or deployed in the cloud. Correlate GPU performance and usage with other technologies that support AI, including large language models use cases.

nvidiaheaderimage

Visualize the health of NVIDIA GPUs

  • Monitor key GPU metrics including temperature, power consumption, and framebuffer usage with out-of-the box dashboards to better understand the state of your AI stack
  • Enable end-to-end visibility into GPU-powered environments with GPU utilization + performance metrics, and process-specific metrics
  • Collect, visualize and alert on metrics from all widely used GPU architectures in minutes, such as NVIDIA’s Tesla, A100, Kepler series, NVSwitch, Maxwell, CUDA 7.5+ and NVIDIA Driver R450+
nvidiadashboardimagetwo

Rapidly Pinpoint the Source of Bottlenecks in GPU Resources

  • Identify GPU temperature issues with customizable monitors and dashboards to determine if issues are an isolated spike or a gradual increase in hardware temperature over time
  • Prevent performance throttle and hardware burnout
  • Optimize inefficient AI workloads with automatic notifications from recommended monitors alerting you to increased memory utilization or a high number of XID errors
nvidiagputempoverviewimage

Get Insights into Model Server Latency and Corresponding Performance and Usage

  • See metrics like count of inference requests, inference failure counts, and batch execution to troubleshoot issues related to model serving performance.
  • Correlate model serving metrics like inference latency with GPU metrics.
/blog/ai-integrations/triton_screenshot

The Essential Monitoring and Security Platform for the Cloud Age

Datadog brings together end-to-end traces, metrics, and logs to make your applications, infrastructure, and third-party services entirely observable.

Platform Diagram

Next-generation infrastructure monitoring

Monitor and troubleshoot infrastructure performance issues rapidly.

watchdog-apm-illustration.png

Watchdog

Automatically detect application performance issues without manual setup or configuration.

tracesearch-apm-illustrationv2.png

App Analytics

Search, filter, and analyze tack traces at infinite cardinality.

servicemap-apm-illustration.png

Service Map

Map applications and their supporting architecture in real time.

Loved & Trusted by Thousands

Washington Post logo 21st Century Fox Home Entertainment logo Peloton logo Samsung logo Comcast logo Nginx logo

NVIDIA GPU Monitoring Resources

Learn about monitoring GPUs, your AI tech stack, and other infrastructure.

Datadog APM Starter Kit