The Monitor

Evaluate LLMs and LLM applications for accuracy with NVIDIA NeMo Evaluator and Datadog LLM Observability

3 min read

Share article

Evaluate LLMs and LLM applications for accuracy with NVIDIA NeMo Evaluator and Datadog LLM Observability
Shri Subramanian

Shri Subramanian

Barry Eom

Barry Eom

Generative AI applications are becoming a core component of how modern enterprises solve customer needs. However, measuring the quality and performance of the underlying models can be a challenge given the nondeterministic nature of their output. NVIDIA NeMo Evaluator—part of NVIDIA's NeMo platform—is a microservice with an easy-to-use API that simplifies the end-to-end evaluation of generative AI applications, including retrieval-augmented generation (RAG) and agentic AI. It supports evaluation for a wide range of custom tasks and domains, including reasoning, coding, retrieval, and instruction-following.

With NVIDIA NeMo Evaluator, developers can automatically evaluate their models against academic benchmarks or custom datasets, or score them with standard metrics including accuracy, ROUGE, BLEU, or LLM-as-a-judge scoring. NVIDIA NeMo Evaluation returns structured scores for each model response. You can seamlessly integrate NeMo Evaluator into your CI/CD pipelines and build data flywheels for continuous evaluation.

In this post, we'll look at how you can use Datadog LLM Observability to monitor NVIDIA NeMo Evaluator's model evaluation scores alongside telemetry data from the rest of your LLM stack to better track changes in model quality.

Collect NeMo Evaluator scores in LLM Observability

Datadog LLM Observability provides end-to-end visibility into the health and performance of your LLM applications. For example, you can trace requests as they propagate across RAG components and model inference and evaluation steps, and you can collect and visualize key model metrics and metadata such as latency, token usage, prompt input, and more. Once you integrate NeMo Evaluator with Datadog, the evaluation scores appear as evaluation metrics tied to the original LLM trace, giving you a complete view of each request.

To integrate NeMo Evaluator with Datadog, use the LLM Observability SDK to submit each evaluation score along with its trace and span IDs. This links model quality metrics directly to the corresponding LLM request for unified analysis​. For example, in the image below, you can see that the LLM Observability span for a question from a dataset with the generated metric from NeMo Evaluator.

Datadog displays NVIDIA NeMo Evaluator score in the relevant LLM trace
Collect and visualize NVIDIA NeMo Evaluator scores in your LLM app traces.
Datadog displays NVIDIA NeMo Evaluator score in the relevant LLM trace
Collect and visualize NVIDIA NeMo Evaluator scores in your LLM app traces.

You can visualize and monitor NeMo evaluation metrics on dashboards or from your app's overview page. You can easily filter by model version, task type, or environment in order to surface any performance changes across your models.

View evaluation metrics in context

Datadog places your NeMo Evaluation metrics in context with other telemetry data from other parts of your LLM-powered apps. For example, each trace displays data like the total latency, token count, and input size. This enables you to correlate drops in model quality with potential system issues. You can also set Datadog alerts on your evaluation metrics (e.g., NeMo Evaluator’s accuracy or helpfulness scores) to notify you if model quality falls below a threshold.

The LLM Observability Cluster Map groups trace data based on different input or output criteria. Because each evaluation score is tied to the originating LLM span, you can visualize clusters based on input or output of public benchmarks used by NeMo Evaluator to identify potential broader patterns or underperforming clusters.

Get started

NVIDIA NeMo Evaluator and Datadog LLM Observability make it easier to measure and monitor the quality and reliability of your LLM applications. You can see our guide for a walkthrough of using both tools to trace requests and evaluate responses of a sample app, and learn how to visualize and alert on model quality metrics in real time. If you're not already a Datadog customer, sign up for a 14-day.

Related Articles

Integration roundup: Monitoring your AI stack

Integration roundup: Monitoring your AI stack

Monitor your Google Gemini apps with Datadog LLM Observability

Monitor your Google Gemini apps with Datadog LLM Observability

Monitor AWS Trainium and AWS Inferentia with Datadog for holistic visibility into ML infrastructure

Monitor AWS Trainium and AWS Inferentia with Datadog for holistic visibility into ML infrastructure

Monitor your OpenAI LLM spend with cost insights from Datadog

Monitor your OpenAI LLM spend with cost insights from Datadog

Start monitoring your metrics in minutes