Tuning the RAG Symphony: A guide to evaluating LLMs

Published in

ML6team

14 min readFeb 27, 2024

Over the last year, chatbots leveraging the capabilities of Large Language Models (LLMs) have become a very popular solution for applications like customer support tools, question answering, and fact checking. One popular approach gaining traction is the Retrieval Augmented Generation (RAG) architecture. It’s like the Swiss Army knife for LLMs, combining a knowledge base with generative models to minimize errors, tapping into up-to-date domain-specific info. When a user asks a question, the smart retriever finds relevant information from a collection of documents, which the language model uses to generate an answer. In this blogpost we will focus on evaluating the generator. In case you’d like to read more about what RAG is and how it works, we recommend reading this blogpost.

Simplified RAG architecture. Through leveraging our Smart Retriever, we can force our Generator to stick to the content of our knowledge base that is most relevant for answering the question. Source

Tuning a musical instrument requires precision and attention to detail, much like fine-tuning a RAG solution. Just as a musician adjusts the strings of their instrument to achieve perfect harmony, developers encounter numerous decisions when crafting a RAG solution. Decisions such as which retrieval method to use, how to chunk the documents for retrieval, which language model to use, and how to prompt or fine-tune the language model. The optimal decision varies depending on factors like the use-case, type of data being used, how much data there is, and the budget for resources.

An open challenge in this domain is evaluating the quality of the responses. With no industry standards defined, organizations resort to manual or ad-hoc evaluations by developers, A/B testing, or letting the users grade the quality of the responses. When it comes to making decisions, this approach is time-consuming, requires expertise, and hard to scale. This makes it tough to evaluate the impact of the changes, making it harder to make the best design choices.

Similar to refining a performance with practice sessions, developers can enhance their RAG solutions through evaluation, visualization, and experimentation. This blogpost contains a guide on how you can evaluate your generator in a RAG setting. We will go over various metrics that are relevant for RAGs to get an insight into how the language model is performing, allowing you to evaluate what the impact is of changes. This allows you to make the right decisions, iterate faster, increasing the performance of your solution. As closed language models (e.g. GPT-3.5-Turbo) are very popular in the world of RAG, we will focus on metrics that also work with black-box models and do not rely on for example the raw probabilities.

If you would like to know more about evaluating, and increasing the performance of the other high-level component in a RAG setting, the retriever, we would recommend reading this blogpost.

Dataset creation

For this blogpost we will consider a typical RAG setup, where upon receiving a question, the system initially fetches some contextual information, then utilizes this retrieved context to formulate an answer. In the development of RAG systems, human-annotated datasets or ground-truth answers are not always available. Hence, we will consider various metrics in this blogpost with different dataset requirements allowing you to select the best fitting metrics for your use-case.

In order to evaluate the language model on your own data the first thing to do is to get a list of typical questions which your RAG should be able to answer. Though it is possible to generate these questions through another language model it is important that these questions are representative for your use-case. Once a set of questions is available we will use the retriever of your RAG solution to retrieve the context for the answers. This means that the context will also be part of your evaluation dataset. This is done in order to make the evaluation more robust and faster, while allowing you to see the impact of changes made. With the question-context pairs set, we can use the current configuration of your RAG language model to generate the answers. These question-context-answer triples form the minimum dataset, which is the base for most metrics. Some metrics require an additional component, the ground-truth. Adding this reference answer can be a tedious task, however it opens up more possibilities, giving you more insight into how your solution is performing. Particularly for a productionalized RAG this may be worth adding, as it can be easily reused during development to monitor the performance.

Metrics

A common framework for evaluating a RAG solution is the RAG triad of metrics. The triad proposes three types of metrics which are relevant for evaluating your RAG solution.

Context Relevance: Making sure that the context retrieved matches the user’s query
Groundedness: Checking if the generated response is factually correct with respect to the context
Answer Relevance: Verifying if the response accurately addresses the initial question.

The RAG evaluation triad. This blogpost will focus on Answer Relevance, Groundedness, and the Response. Source

As this blogpost focuses on evaluating the generator performance and not the retriever we will leave out the context relevance metrics. On the other hand we will take a broader look than what the framework initially suggests and also demonstrate metrics evaluating the quality of the response by themselves. Therefore we introduce five metric categories for this blogpost:

Answer relevance: metrics verifying if the response accurately addresses the initial question.
Groundedness: metrics attempting to measure how correct an answer is, if hallucinations occurred, and what the quality of the answer is.
Linguistic quality: metrics attempting to measure the writing style, grammar mistakes, and readability.
Rule-based: in a typical RAG solution, rules often get imposed on the language model. These metrics attempt to measure how often the language model adheres to them.
Usage: metrics measuring usage statistics such as costs, latency, and token-usage

Answer relevance

The concept of answer relevance entails that the answer should effectively respond to the posed question. An answer is considered relevant if it directly addresses the question in an appropriate way. This means that the metric does not take factuality into account, however it does get penalized when the answer is incomplete or has redundant information.

A common approach to implement this metric, proposed by RAGAS, attempts to measure the answer relevance by generating n questions given the answer through prompting a language model. Both the original question and the generated questions get embedded after which the average cosine similarity is measured. The idea of the metric is that if the generated answer is very relevant to the question posed, while not containing a lot of redundant information, it should be possible to deduce the question which was asked.

When implementing the metric it is important to take into account that RAGAS uses few-shot prompting to generate the questions. Therefore it might be necessary to modify the examples provided in the prompt to be more similar to your use-case in order to get a more accurate measurement of the answer relevancy.

An alternative way of measuring the answer relevance is proposed by ARES. They suggest fine-tuning an LLM judge, specifically for your use-case, to measure the answer relevance. According to the authors this results in a better performing solution, however it might be more time consuming to implement due to the dataset requirements.

Groundedness

These metrics measure how correct an answer is, if hallucinations occurred, and what the quality of the answer is.

Faithfulness

RAG systems are often used in applications where factual consistency with respect to the provided context is highly important. Faithfulness refers to the idea that the answer should be grounded in the given context. An answer is considered faithful if the claims made in the answer can be inferred from the context.

To measure the faithfulness, RAGAS proposes a two step approach. The first step is to prompt a language model to generate the statements which can be extracted from the generated answer. The second step is a verification function which prompts the language model if these statements can also be inferred from the given context. The faithfulness score is then the ratio of the total number of statements extracted from the answer to the number of supported statements. Similar to the answer relevance, RAGAS uses few-shot prompting in their implementation. To get the most accurate results it might be necessary to update the prompt to be more relevant for your use-case.

This metric is also implemented in the ARES framework, where they fine-tune an LLM judge on your data measuring the faithfulness. Depending on the use-case it might be worth the additional effort of implementing this approach to get more accurate results.

Semantic similarity

A simple performance metric is checking how similar the generated answer is to the actual truth. You want your answer to be as close to the truth as possible. But, remember, you need the ground truth in your dataset to make this comparison.

There are two common strategies for measuring the semantic similarity, through a bi-encoder, and through a cross-encoder. In a bi-encoder setup, each document gets embedded on its own, so you can measure similarity afterwards. With a cross-encoder, it takes in two inputs at once and spits out the similarity score right away. As the cross-encoder has access to both the generated answer and the ground truth at inference time it is capable of producing more accurate results.

Comparison between Cross-Encoder and Bi-Encoder.

Factual similarity

The last metric for the Groundedness section is factual similarity. The idea of factual similarity is to measure how similar the claims presented in the generated answer are to the claims presented in the ground truth.

This metric is implemented in RAGAS as part of answer correctness, and not as a separate metric. Through few-shot prompting they prompt a language model to detect three types of claims:

True positive (TP): claims present in both the answer and the ground truth
False positive (FP): claims present in the answer but not found in the ground truth
False negative (FN): claims found in the ground truth but missing in the answer

Factual Similarity formula

Just like with the other metrics implemented in RAGAS, updating the prompt and its few-show prompting examples might be needed to make this metric work reliably. Interestingly RAGAS chose to not implement this metric on its own but only as part of answer correctness. Answer correctness is the weighted average of the semantic and factual similarity. However, when doing extensive evaluation of the language model in your RAG application having both metrics separately might provide more insights.

Linguistic quality

The metrics in this section attempt to measure the standalone quality of the response generated.

Readability

Gaining insight into how readable the output text of your RAG application is can be very important depending on the use-case. For example, if you are building a customer helpdesk application you would like the output response to be very simple, and easy to understand. While on the other hand, it is fine to output more complex text if you are building an internal RAG application used by domain experts. Readability scores exist to give a measure of how easy or difficult the text is to read. This is for example used by the publishers to determine the appropriate target audience for books, articles, and other written materials. This ensures that the content is accessible and engaging to the intended readership.

Various ways of calculating the readability score exist, a very common method of which is the Flesch–Kincaid readability score. This measure, scores against average words per sentence and syllables per word. Higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read.

Formula for the Flesch reading-ease score.

When using Python, this score, and many other readability scores are available through the texstat package. Note that the scoring function can change depending on the language.

Grammatical Error Rate

The Grammatical Error Rate (GER) gives an indication of how grammatically accurate sentences are. This becomes particularly important if your RAG application utilizes a language for which the LLM was not explicitly trained. In cases where the training dataset lacks sufficient examples of your desired language, the LLM may not have fully learned its grammatical structures. A high GER indicates that the language model struggles with the grammatical structures.

To detect spelling and styling errors you can use existing spell checkers such as the Libre Office spell checker Language Tool for which Python wrappers exist. The GER can be defined as the ratio of sentences containing a mistake to the total number of sentences in the answer. When using Language Tool keep in mind that by default it will output all suggestions. This means that a small style suggestion has as much weight as an actual, big mistake. It is possible to enable and disable certain rules if you want to change this behavior.

Aspect critique

During RAG development you sometimes want to evaluate a specific aspect of your language model output, which can be hard to explicitly measure. Examples would be maliciousness, friendliness, or coherence. To evaluate these you can use a language model to grade the answers. To do this it is recommended to use one/few-shot prompting to give a clear indication to the language model how the grade should be assigned. The grading scale could either be binary or integers depending on the aspect and the desired outcome. RAGAS implemented this metric using a one-shot prompt and a binary grading scale for the following aspects: harmfulness, maliciousness, coherence, correctness, and conciseness.

Lexical density

The Lexical Diversity Score (LDS) serves as a metric for assessing the vocabulary richness of language model responses. It’s determined by dividing the number of unique words in the sample by the total word count of the sample.

LDS evaluates the variety of words utilized in the generated text, offering insight into the model’s capacity to express itself using diverse vocabulary and to steer clear of repetition. High LDS scores suggest the model’s potential for creativity and uniqueness in its responses.

Rule-based

When building a RAG solution, the prompt often contains a set of hardcoded rules to which the model should adhere. This works well with short and simple prompts but when the prompt increases in size and/or complexity, the language model may not always adhere to the rule. Testing adherence to the rules is therefore needed. The scoring for all of the rules is similar. Based on whether the rule condition is met, a binary label is assigned. The final score is then the ratio of answers meeting the rule condition to the total number of answers. Although the rule conditions vary a lot depending on the use-case, there are some common ones such as the following:

Language: when building a RAG application using a non-English language the prompt often contains a rule to answer in a certain language. This can easily be verified by checking the languages presented in the answer.
Max output length: a common condition is a rule stating something like: “ALWAYS answer in less than 5 sentences”.

Depending on your use-case and prompt you can implement and evaluate various rules. This way it is possible to get some insight in how strictly the language model adheres to them.

Usage

Especially when switching between language models, gathering and evaluating language model usage statistics becomes very important. Note that it is best to keep track of these when generating the answers for your dataset. Given development and evaluation is often an iterative process this can be easily integrated into the workflow.

Latency

This represents the overall time it takes for the model to generate the full response for a user. This can be split into two sub metrics. Time to First Token (TTFT), which is how quickly the user starts seeing some of the model output after querying. Time Per Output Token (TPOT) is the time it takes to generate an output token. The latency is then: (TTFT) + (TPOT) * (the number of tokens to be generated).

Often only measuring the latency can already be sufficient to get some insights. Depending on the use-case, measuring the TTFT and TPOT becomes relevant such as when response streaming is used.

Costs

Keeping track of the total cost of the generated answers can be very important for making the right design decisions during development. Due to the cost difference between language models, keeping track of this can result in great, future cost savings.

Visualization

With the dataset created, the metrics chosen, and the evaluation executed it is time for the last step, visualization. Although having the raw result of the metrics is already useful, visualizing them can give a much better overview of how you improved your model compared to the previous iteration. One intuitive way of doing this is through a radar chart. A radar chart is ideal for metric comparison because it allows multiple metrics to be visualized simultaneously. With each metric represented along a separate axis, it’s easy to compare the performance or characteristics across different entities. This makes it a quick and intuitive way to identify strengths, weaknesses, and patterns across various metrics.

Example of a Radar chart for the RAGAS metrics. Source

This radar chart can then be created per entry in your dataset or for the final, averaged scores. Keep in mind that if you are evaluating metrics with scoring scales that differ you will also need to add different axis scales to your radar plot.

Experiment tracking

An optional component which can improve your evaluation setup even further is to add experiment tracking. With experiment tracking you keep track of all requests made to the language model, the answers generated, and the metric scores assigned to each request. A nice way to do this is to set up LangFuse and configure your metric scores as custom scores. This allows you to:

Display the scores on trace in order to provide a quick overview
Segment all execution traces by scores to e.g. find all traces with a low-quality score
Report detailed scores with drill downs into use cases and user segments

Example of the RAGAS metrics added to a trace view. Source

End-to-end evaluation

Errors made in the retrieval step might also result in errors in the generation step. However, errors occurring in generation, do not affect errors in retrieval. Evaluating and monitoring a RAG system end-to-end is still uncommon, however can be crucial when you want to be monitor a live system in order to make the right changes with respect to changing data. While this topic is best suited for a follow-up blog post, it’s important to briefly acknowledge here.

Conclusion

There you have it! Make sure to select metrics that align with your objectives and provide meaningful insights into the performance of your RAG solution. Furthermore, adhere to standard best practices, such as splitting your dataset, especially when making improvements based on evaluations. This precaution helps guard against overfitting and ensures that your enhancements generalize well to real-world scenarios.

With visualizations like radar charts and the addition of experiment tracking, it’s like adding some funky lights and special effects to your performance. You can see where you’re shining and where you need a little more practice.

Evaluating your RAG solution is like fine-tuning a musical instrument — it’s all about hitting the right notes.