The rapidly growing interest in AI has raised a corresponding demand for specialized cloud compute that is built to run training and inference workloads in a cost-efficient and performant manner. Google Cloud Tensor Processing Units (TPUs) have become a popular accelerated compute solution for AI/ML workloads. TPUs are specialized circuits designed to process matrix multiplication operations required in neural networks at orders of magnitude faster than CPU- and GPU-based cloud resources when fulfilling tensor-based workloads. However, in order to optimize performance, detect resource bottlenecks, and control cloud spend, organizations that adopt this solution need granular visibility into their hardware’s utilization, performance, and resource consumption.
Datadog’s Google Cloud integration collects metrics and logs from Google Cloud TPUs. This enables you to monitor utilization, total resource usage, and other performance metrics at the container, node, and worker levels, so you can rightsize your TPU infrastructure and respond to changes in usage in order to balance training costs with performance. In this post, we’ll discuss how you can:
- Quickly visualize TPU metrics with our out-of-the-box dashboard to help you optimize performance
- Be alerted to possible rightsizing opportunities using our recommended monitors
Visualize TPU metrics with Datadog’s preconfigured dashboard to help optimize performance
After you’ve set up Datadog to collect telemetry from your Google Cloud services, Datadog will collect logs and metrics from all your Google Cloud TPUs. Whether you plan to orchestrate TPUs with Google Kubernetes Engine (GKE), fine-tune LLMs using your own training data, or apply TPUs to other large-dataset training, our integration provides a centralized view into your TPU usage across virtual machines, GKE, custom job runners, and more so you can make informed changes to your model parameters to achieve stronger performance at lower costs.
Datadog’s Google Cloud TPU dashboard enables GKE customers to gain insights into the utilization of their TPU clusters by tracking the TensorCore utilization and duty cycle metrics. Duty cycle refers to the percentage of time your TPUs are actively computing any workload across a given time frame, while TensorCore utilization refers to the percentage of time your TPUs are leveraging their tensor cores, which are a type of specialized hardware for running compute-intensive tasks such as matrix operations. Our recommended monitors can notify you when your TensorCore utilization is below your target utilization. When notified, you may want to consider increasing your batch sizes to take greater advantage of your TPU’s TensorCore capabilities and improve your model’s throughput.
Operating with large batch sizes can be critical in achieving cost-efficient training. However, larger batch sizes also create additional pressure on memory resources. As you increase your batch size, you’ll need to monitor your TPUs’ memory usage to avoid out-of-memory (OOM) errors and other resource constraints. If you do encounter OOM errors, you may need to restructure your input data to avoid padding.
The speed of an LLM is often bottlenecked by its memory bandwidth—when operating at smaller batch sizes, the model is bound by how quickly it can load parameters from GPU memory to local caches rather than how quickly it can compute the data once the data is loaded. For this reason, if your model uses smaller batch sizes, memory bandwidth utilization (how much of the model’s peak bandwidth is being used at any given time) can be a better indicator for your model’s inference speed than its TensorCore utilization. You’ll want to monitor both utilization metrics while fine-tuning your batch configurations. This will help you find the point where your model’s memory and compute constraints meet so you can maximize your hardware’s resource utilization.
Datadog’s dashboard also gives you visibility into your TPU workers with additional granularity, allowing you to visualize a worker’s instance uptime, different resource usage and utilization metrics, as well as its network traffic. If you notice that workers are consistently underutilized, consider consolidating your workloads to fewer workers or downsizing to an instance type with fewer cores.
Be alerted to rightsizing opportunities and save on infrastructure costs
As your applications evolve, so will your infrastructure utilization. Configuring monitors based on TPU metrics helps you stay on top of your changing infrastructure and be alerted to possible rightsizing opportunities.
Datadog provides several recommended monitors for TPU metrics that can alert you to underutilization of your infrastructure at both the node and container level. Recommended monitors are configured OOTB for quick installation, and they can help your organization take a more proactive approach to optimizing your TPU usage and spend, especially if you’re new to using Cloud TPUs. If you receive a notification that the TensorCore utilization for your model running on a GKE node has decreased following changes you’ve made, this can be a signal to consolidate workloads on that node to a smaller instance type. For instance, switching from a TPU v5p pod to a TPU v6e pod can drastically reduce your model training costs if your current infrastructure is underutilized.
Start monitoring your Google Cloud TPU infrastructure with Datadog
Datadog’s Google Cloud integration enables you to monitor your Cloud TPU metrics so you can detect resource bottlenecks and underutilization and make subsequent changes to optimize performance and cloud spend. Our preconfigured dashboard and recommended monitor templates are quick to install, enabling you to gain immediate value and insight into your cloud AI infrastructure with just a few clicks. To learn more, check out our documentation. If your organization uses Google Cloud Vertex AI to leverage TPUs for model training, learn how our Vertex AI integration can help you gain additional visibility into your AI stack.
If you don’t already have a Datadog account, sign up for a free 14-day trial today.