LLMs like GPT-4, Claude, and Llama are behind popular tools like intelligent assistants, customer service chatbots, natural language query interfaces, and many more. These solutions are incredibly useful, but they are often constrained by the information they were trained on. This often means that LLM applications are limited to providing generic responses that lack proprietary or context-specific knowledge, reducing their usefulness in specialized settings. For example, a customer interacting with a financial service company’s chatbot to ask about the status of their recent loan application or for specific details about their account may only receive generic responses about loan processing timelines or account management. That’s likely because the LLM lacks access to customer-specific information, recent transaction details, or any real-time updates from the company’s systems.
To address these limitations, organizations often integrate retrieval-augmented generation (RAG) into their LLM applications. RAG enhances the accuracy and relevance of responses by retrieving information from external datasets beyond an LLM’s preexisting knowledge base, such as websites, financial databases, company policy guidelines, and handbooks. By incorporating that data into the prompt-response cycle, models can generate responses enriched with context from unique data sources. For instance, by implementing RAG, our earlier chatbot example can now access the most recent policy documents from the company’s database in real time, providing up-to-date and accurate responses instead of relying on the old information it was originally trained on.
While RAG enhances LLMs, developers and AI engineers still face certain challenges when building these systems. Integrating all of the retrieval and generation components of RAG-based systems at scale introduces complexity in managing latency, ensuring the relevance of retrieved data, and maintaining model accuracy, all while handling large volumes of diverse information in real time. With so many moving pieces, it can be difficult to identify where and when something has gone wrong.
In this post, we’ll explore how to mitigate some of the common challenges you may face by:
- Chunking and choosing the right embedding model to reduce latency
- Implementing hybrid search to limit irrelevant or inaccurate responses
- Using vector databases to exclude outdated information
- Scanning prompts and responses to prevent accidental exposure of sensitive data
Chunking and choosing the right embedding model to reduce latency
RAG-based systems retrieve relevant information from a knowledge source or database at inference time, which means that it’s absolutely vital for retrieval to be fast and with minimal delay. High retrieval latencies increase the overall latency of your application, resulting in slow responses, degraded performance, and a negative user experience.
You may, for example, receive complaints that your application is slow at responding to prompts and frustrating your users, who wind up closing the session and leaving the application.
There are a few steps you can take to mitigate the issue. For instance, you can improve latency when choosing an embedding model. Embeddings are numerical representations of text and other data types that LLMs use to understand the semantic and contextual relationship between different pieces of information they may encounter. In an embedding model, values representing words such as “cat” and “dog” might be closer together than the value for “rocketship” because cats and dogs are both animals.
When choosing embedding models, engineers are often tempted to select those with the highest dimensionality, which captures more context-rich and highly granular data. However, as dimensionality increases, so too does the amount of data that needs to be processed, which increases latency and drives up costs. By choosing lower-dimensional embeddings, you can reduce the computational load and speed up retrieval without significantly sacrificing accuracy, leading to faster and improved performance.
You should also consider chunking large documents or inputs into smaller, manageable parts that fit within token limits. By processing and retrieving smaller, contextually meaningful chunks, overall response times improve as models can retrieve the most relevant information faster. You might consider selecting models with smaller token limits, which can help improve latency because it means models will have less data to process.
Implementing hybrid search to limit irrelevant or inaccurate responses
The quality of responses your RAG system generates relies heavily on the relevance of the documents retrieved when it receives a query. If, for example, a customer service chatbot for a clothing brand’s e-commerce site is asked about sales on jackets and then erroneously retrieves documents about sales on jackets and shirts, it will return irrelevant information outside of the scope of the original query. Alternatively, your system might retrieve the correct documents but still generate factually inaccurate responses.
You can improve your system’s document retrieval mechanism by implementing hybrid search. Hybrid search combines term-based retrieval, which looks for exact keyword matches, with vector-based retrieval, which uses embeddings to capture the meaning or semantics of both queries and documents. By using both retrieval methods, your system can filter the list of possible documents down to a manageable volume and then rerank the remaining documents to reorder search results based on semantic similarity. Breaking the list down into a smaller set with term-based retrieval improves the efficiency of vector-based retrieval, allowing the system to perform detailed similarity comparisons without overwhelming resources.
Responses include outdated information
It’s crucial that the data your RAG system retrieves stays fresh and up-to-date. This is particularly important in cases where systems are expected to deliver data that may be updated in real time, such as breaking news and stock market reports. If a system relies on outdated information, it not only loses its key advantage but can also lose the trust of its users. For example, if users rely on a RAG system to deliver breaking news but consistently receive stale information, they will leave for more reliable news sources.
You can mitigate the risk of retrieving irrelevant information by implementing effective metadata filtering, which uses structured data descriptors—such as tags, timestamps, categories, and authors—to refine and target searches. Metadata filtering is crucial for retrieving contextually relevant information at scale, as it narrows results to meet specific criteria (like recent dates or particular topics) to enhance response quality. You can also lower the risk of retrieving irrelevant data through regular updates and proper versioning. Establishing data pipelines for periodic updates and proper versioning ensures that data stays current and reliable.
Application retrieves sensitive data that shouldn’t be exposed
Without the proper guardrails, RAG systems can unintentionally retrieve and expose sensitive data that users should not be able to access—creating a serious security and privacy issue. This exposure can be due to a number of factors, including lack of strong access control permissions, insufficient query filtering, and the inclusion of sensitive information within training data.
Your first line of defense against security vulnerabilities is to make sure you have tight permission and access controls in place to restrict accessibility to sensitive information. For instance, let’s say you use Elasticsearch as the retrieval engine of your RAG system. With Elasticsearch you can define roles to specify permissions and ensure your system only retrieves documents that users have read access to.
“Security & Safety” detections in Datadog LLM Observability will scan LLM prompts and responses for signs of data leakage and attempted prompt injections. In the screenshot below, a user triggered the Prompt Injection Scanner by asking a chatbot to return an admin password. You can identify the bad actor who attempted this breach and take appropriate security measures, such as logging the attempt, blocking the user, or escalating the issue to the security team for further investigation and mitigation.
Start troubleshooting your RAG-based LLMs today
In this post, we looked at some of the challenges AI developers and engineers face when building out RAG-based LLMs along with some steps you can take to overcome those challenges. And with Datadog LLM Observability, you can identify when your RAG system experiences high latencies, fails to respond accurately, and encounters other issues connected to these challenges so that you can quickly begin troubleshooting. To learn more about LLM Observability, check out our documentation.
And if you aren’t already using Datadog, sign up today for a 14-day free trial.