Monitor Your Anthropic Applications With Datadog LLM Observability

Monitor your Anthropic applications with Datadog LLM Observability

Anthropic is an AI research and development company focused on building reliable and safe artificial intelligence systems. Their flagship product is Claude, an advanced language model and conversational AI assistant known for its strong capabilities in natural language processing, reasoning, and task completion. Anthropic places a particular emphasis on AI safety and ethics, and its models and APIs are used by organizations across various industries to build powerful, safe, and performant AI applications.

We are pleased to announce Datadog’s native integration with Anthropic and LLM Observability, which you can use to monitor, troubleshoot, and secure your Anthropic LLM applications. The integration enables Anthropic customers to use Datadog for:

Enhanced visibility and control with real-time metrics that provide insights into Anthropic models’ performance and usage
Streamlined troubleshooting and debugging with granular visibility into LLM chains via distributed traces
Quality and safety assurance with out-of-the-box evaluation checks

In this post, we will discuss how these features within Datadog LLM Observability can help AI engineers and software developers develop accurate, cost-efficient, safe, and secure Anthropic-powered LLM applications at scale.

Track Anthropic usage patterns

Cost efficiency and performance are two of the most important concerns of modern LLM applications. As AI application teams rapidly scale up their usage of Anthropic APIs to tackle more complex use cases, it becomes increasingly crucial to monitor requests, latencies, and token consumption effectively.

Token usage can fluctuate depending on the models employed, each of which comes with its own pricing structure. LLM Observability’s included metrics can help organizations manage and understand these complexities. You can monitor many of these metrics by using LLM Observability’s out-of-the-box dashboard, which provides a comprehensive view of application performance and usage trends across your organization. The dashboard includes detailed operational performance metrics, including trace- and span-level errors, latency, token consumption, model usage statistics, and any triggered monitors.

Monitor key Anthropic metrics in the LLM Observability dashboard

Troubleshoot your Anthropic application faster with end-to-end tracing

Rapid advancements in generative AI technology have made LLMs faster, cheaper, and equipped with larger context windows, allowing developers to create specialized applications. Anthropic Claude models have demonstrated strong performance in reasoning tasks, making them well-suited for complex chain-of-thought applications. This allows developers to design LLM agents for more sophisticated and nuanced tasks.

As customers build complex LLM chains where an initial request can trigger a series of distributed system calls, they also introduce multiple points of failure. An LLM application request could fail not only due to “hard” errors—such as timeouts or bad API calls—but also “soft” errors, where the request executes successfully but returns an incorrect or poor response. These soft errors are particularly important to track—and also particularly difficult and time-consuming to find and diagnose.

LLM Observability’s traces help you identify and solve these errors by providing detailed information about each step in your LLM chain’s execution and highlighting errors and latency bottlenecks. LLM Observability’s deep integration with Anthropic automatically captures Anthropic API requests without requiring any manual instrumentation. This allows you to focus on instrumenting other parts of your LLM application.

When tracing an Anthropic API call, you can carefully inspect the input prompt and observe each of the steps your application took to form the final response. By looking at these intermediate steps, you can quickly discover the root cause of unexpected responses.

Evaluate your LLM application inputs and outputs for quality and safety issues

LLM applications must be rigorously evaluated for quality and safety, particularly due to their non-deterministic nature. These applications are susceptible to various attack techniques, posing significant risks. Datadog LLM Observability provides out-of-the-box quality and safety checks to help you monitor the quality of your application’s output, as well as detect any prompt injections and toxic content in your application’s LLM responses. These features enable you to maintain high standards of performance and ethical AI usage, aligning with Anthropic’s commitment to developing safe and effective AI technologies.

The trace side panel allows you to view these quality checks, which include metrics like “Failure to answer” and “Topic relevancy” to assess the success of responses. Additionally, checks for “Toxicity” and “Negative sentiment” are included to indicate potential poor user experiences. By leveraging these tools, you can ensure your LLM applications operate reliably and ethically, addressing both performance and safety concerns.

Monitor quality checks for Anthropic LLM applications

LLM Observability integrates with Sensitive Data Scanner to scrub personally identifiable information (PII) from prompt traces by default in order to help you detect when customer PII was inadvertently passed in an LLM call or shown to the user.