Evaluating prompts and responses
The information and exercises here draw on material developed for Arizona State University’s online undergraduate course on prompt engineering using ChatGPT. They concentrate on two critical aspects of prompt engineering: evaluation and metrics. A thorough understanding of how to assess the performance of prompts is vital to ensure their effectiveness and reliability in various applications.
These exercises will help you develop the tools and knowledge necessary to evaluate and measure the success of your prompt development and crafting efforts. They work best if using the paid ChatGPT Plus service from Open AI and GPT-4.
Introduction
The following video is part of ASU’s Introduction to Prompt Engineering course, and introduces the concept of prompt and response evaluation, as well as the RACCCA framework.
Overview
These exercises concentrate on two critical aspects of prompt engineering: evaluation and metrics. A thorough understanding of how to assess the performance of prompts is vital to ensure their effectiveness and reliability in various applications. Through the exercises, you will develop the tools and knowledge necessary to evaluate and measure the success of your prompt development and crafting efforts.
First, you will explore the significance of prompt response evaluation techniques. You will learn why evaluating ChatGPT-generated responses is important for refining prompt performance and ensuring their effectiveness in various contexts.
Next, you will be introduced to the RACCCA framework for evaluating prompt responses. This is a framework that has been developed for this course. It isn’t the only one, but it is a useful starting point. This comprehensive approach to evaluating prompt responses, with RACCCA standing for Relevance, Accuracy, Completeness, Clarity, Coherence, and Appropriateness.
As we progress, you will develop an understanding of metrics-based evaluation. You will learn how to use numeric scores to quantitatively measure the success of your prompts. This objective approach will help you to make informed decisions for further improvement and help you effectively assess the usefulness of your prompts—all while remembering that although you are using numbers, your evaluations are still subjective.
These exercises will also guide you in applying the insights gained from evaluation and metrics to refine your prompts in an iterative manner. This process will ensure that your prompts continually improve and become more effective and reliable over time.
Finally, we will focus on the comparative evaluation of prompts. You will learn how to leverage prompt response evaluation and metrics to compare different prompts focused on similar outputs. This skill will help you make informed decisions when choosing the most effective prompt for a given situation.
By exploring these key aspects, you will gain a deeper understanding of prompt evaluation methodologies and metrics, which will enable you to create and refine prompts that are effective and reliable across various applications.
Exercise: Exploring prompt evaluation and metrics with ChatGPT
This exercise is designed to allow you to explore approaches to prompt evaluation and metrics by playing and experimenting with ChatGPT.
Note: In machine learning and machine learning-based prompt-engineering there are a number of formal methods to evaluate the quality of AI responses to questions. These include methods such as BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and METEOR (METEOR (Metric for Evaluation of Translation with Explicit ORdering). In the exercise below, ChatGPT may try and teach you about these – please note that they are beyond the scope of this class, and you do not need to learn about them here!
EXERCISE:
1. Open a session with ChatGPT, making sure that you are in GPT-4 mode.
2. Craft and use a prompt that asks ChatGPT to teach you about evaluating the quality of prompts when using ChatGPT and metrics that may be used to do this. At this stage you should have sufficient skill to do this without further instruction.
3. If your initial prompt doesn’t lead to a useful session with ChatGPT, try different versions of the prompt until you get one that initiates a learning session that is both relevant and at the right level (you will find that some prompts lead to ChatGPT covering material that is not relevant to this course).
By the end of this exercise you should have better understanding of the importance of iteratively redesigning prompts in ways that inprove the quality of responses from ChatGPT.
Exercise: Response Evaluation Criteria
There are a number of factors that come into play when evaluating the quality of responses from ChatGPT to a particular prompt. In the context of this exercise we are focused on developing the professional skills that allow the quality of responses to be assessed so that prompts can be refined and improved.
In this exercise you will explore six specific dimensions of response quality:
1. Relevance: The extent to which the response directly addresses the issue or question.
2. Accuracy: The degree to which the response provides correct, reliable, and fact-based information.
3. Completeness: The degree to which the response covers all essential aspects of the topic or question being asked.
4. Clarity: How easily the response can be understood by the intended audience.
5. Coherence: The extent to which the response is logically structured and well-organized and flows smoothly from one point to another.
6. Appropriateness: How well the response aligns with the intended audience and context and is suitable and respectful in tone and content.
We’ll refer to this as the RACCCA framework — it was introduced in the video at the top of this page.
EXERCISE:
This assignment is complex, so please pay attention!
The assignment is split into three parts, and is designed to develop your understanding of how to improve the quality ChatGPT responses through applying the RACCCA framework.
• Part 1 is designed to demonstrate how the RACCCA framework can be used to identify disconnects between a prompt and a response. It does this by artificially forcing ChatGPT to give a less than ideas response to a prompt.
• Part 2 examines how disconnects between a prompt and a response can flag up issues, even when the response is technically strong.
• Part 3 looks at how the RACCCA framework can be used to improve prompts.
Part 1
1.1 Come up with a clear and precise question-based prompt. For instance “How, when, and where did dogs first become domesticated?” or “How do you replace a blocked p-trap on a sink?” be creative.
1.2 Open a new ChatGPT session, making sure you are in GPT-4 mode.
1.3 Use the following prompt, substituting [question] with your prompt from above: “Hi ChatGPT. I would like you to answer this question in a very imprecise and confusing way: [question]”
1.4 Make a copy of your original prompt and ChatGPT’s response
1.5 Open a new ChatGPT session using GPT-4
1.6 Use the following prompt, substituting [question] and [response] with your prompt from above and ChatGPT’s response from above:
“Hi ChatGPT. In response to the question [question] I got the following response: [response]. Please assess this response in terms of: Relevance (The extent to which the response directly addresses the issue or question); Accuracy (The degree to which the response provides correct, reliable, and fact-based information); Completeness (The degree to which the response covers all essential aspects of the topic or question being asked); Clarity (How easily the response can be understood by the intended audience); Coherence (The extent to which the response is logically structured and well-organized and flows smoothly from one point to another); and Appropriateness (How well the response aligns with the intended audience and context and is suitable and respectful in tone and content).”
To make things easier, you can use this seed conversation.
7. Pay attention to ChatGPT’s evaluation
Part 2
2.1 Come up with another clear and precise question-based prompt. Again, feel free to be creative.
2.2 Open a new ChatGPT session, making sure you are in GPT-4 mode.
2.3 Ask ChatGPT the following, substituting [prompt] with your prompt from above: “Hi ChatGPT. I’d like you to take the prompt [prompt] and make it extremely general and non specific, while sticking with the main focus of it.” Make a note of the “downgraded” prompt.
2.3 Open a new ChatGPT session, making sure you are in GPT-4 mode.
2.4 Ask ChatGPT the downgraded prompt from above. Make a copy of the response.
2.5 Open a new ChatGPT session, making sure you are in GT-4 mode.
2.6 Use the following prompt, substituting [question] with your original prompt from above (not the downgraded one) and [response] with ChatGPT’s response from above:
“Hi ChatGPT. In response to the question [question] I got the following response: [response]. Please assess this response in terms of: Relevance (The extent to which the response directly addresses the issue or question); Accuracy (The degree to which the response provides correct, reliable, and fact-based information); Completeness (The degree to which the response covers all essential aspects of the topic or question being asked); Clarity (How easily the response can be understood by the intended audience); Coherence (The extent to which the response is logically structured and well-organized and flows smoothly from one point to another); and Appropriateness (How well the response aligns with the intended audience and context and is suitable and respectful in tone and content).”
To make things easier, you can use this seed conversation.
2.7 Pay attention to ChatGPT’s evaluation.
Part 3
3.1 Come up with a prompt for ChatGPT. As usual, feel free to be creative.
3.2 Open a new ChatGPT session, making sure you are in GPT-4 mode.
3.3 Ask ChatGPT the following prompt, substituting [prompt] for your prompt from above:
“Hi ChatGPT. Please provide me with five alternatives to the prompt [prompt] that will substantially increase the quality of your response based on the RACCCA framework. The RACCCA framework refers to Relevance (The extent to which the response directly addresses the issue or question); Accuracy (The degree to which the response provides correct, reliable, and fact-based information); Completeness (The degree to which the response covers all essential aspects of the topic or question being asked); Clarity (How easily the response can be understood by the intended audience); Coherence (The extent to which the response is logically structured and well-organized and flows smoothly from one point to another); and Appropriateness (How well the response aligns with the intended audience and context and is suitable and respectful in tone and content).”
To make things easier, you can use this seed conversation.
3.4 Select the alternative that most resonates with you.
3.5 Open a new ChatGPT session, making sure you are in GPT-4 mode.
3.6 Provide ChatGPT with the alternative prompt from above, and make a copy of the response.
3.7 Use the following prompt, substituting [question] and [response] with the prompt and ChatGPT’s response from above:
“Hi ChatGPT. In response to the question [question] I got the following response: [response]. Please assess this response in terms of: Relevance (The extent to which the response directly addresses the issue or question); Accuracy (The degree to which the response provides correct, reliable, and fact-based information); Completeness (The degree to which the response covers all essential aspects of the topic or question being asked); Clarity (How easily the response can be understood by the intended audience); Coherence (The extent to which the response is logically structured and well-organized and flows smoothly from one point to another); and Appropriateness (How well the response aligns with the intended audience and context and is suitable and respectful in tone and content).”
To make things easier, you can use this seed conversation.
3.8 Pay attention to ChatGPT’s evaluation.
Exercise: Response Evaluation Metrics
This exercise introduces you to the idea of numerically assessing the quality of responses to prompts using a 5 point scale.
It’s important to note that these assessments are qualitative — there is no quantitative basis for what score you give a response, other than your own reasoning, understanding, and intuition. However, this approach does make it easier to gague when a prompt leads to more or less useful responses.
In this exercise we explore how the quality of a response from ChatGPT can be quantified. This is useful in evaluating responses to different prompts (which we will cover in the next exercise). It’s also useful in learning how to both develop more effective prompts and assess the quality of the responses that you get.
We will stick with the six criteria used in the RACCCA framework:
Relevance
Accuracy
Completeness
Clarity
Coherence
Appropriateness
Each of these will be assigned a scale from 1 (poor) to 5 (good), and you will use these to evaluate the response of ChatGPT to an initial prompt. You will then refine the prompt until you get as high an overall score as possible.
This exercise is obviously easy to game by just giving ChatGPT full marks first time round. You will learn a lot more though if you are tough with your grading!
EXERCISE:
1. Open a new session with ChatGPT, making sure that you are in GPT-4 mode.
2. Use the following prompt: “Which are best, apples or oranges?”
3. Evaluate ChatGPT’s response in terms of the following, and note the aggregate score. Make sure you refer back to the definitions of each above:
Relevance (1 – 5)
Accuracy (1 – 5)
Completeness (1 – 5)
Clarity (1 – 5)
Coherence (1 – 5)
Appropriateness (1 – 5)
4. Refine the prompt and iterate to produce responses that lead to a higher score (remember, you are doing the scoring, not ChatGPT). You may find it helpful to start a new ChatGPT session with each iteration. You are free to interpret the original prompt in ways that help you refine its usefulness.
At the end of this exercise you should be comfortable assessing the quality of responses to prompts using an (albeit subjective) numeric scale. This is useful as you go into the following exercise.
Exercise: Comparative Evaluation of Different Prompts
This exercise takes the previous exercise a step further by using the evaluation metrics to compare different versions of a similar prompt.
Even though we will use numeric scores, it’s important to remember that these are subjective, and that this is simply a tool to help refine prompts.
We will use the example of asking ChatGPT to help come up with ideas for a futuristic workplace for undergraduate students who are working on assignments.
You will be asked to evaluate the responses to four different prompts using the “RACCCA” scale:
- Relevance (1 – 5)
- Accuracy (1 – 5)
- Completeness (1 – 5)
- Clarity (1 – 5)
- Coherence (1 – 5)
- Appropriateness (1 – 5)
When you apply this scale, think about the type of workplace you would want to spend time in.
The four prompts are:
“Design a workspace for undergraduate students to work in assignments in”
“Imagine a place that is designed to be as comfortable and as inviting as possible for undergraduate students to hang out in as they are working on assignments”
“Provide details of how to design a space where undergraduate students can come and work on assignments without distraction, and where they can be comfortable and focus on their work”
“Hi ChatGPT. I would like you to come up with a great design for a space where undergraduate students can come and work on assignments. The space should be incredibly welcoming, not distracting, and extremely comfortable. It should also help undergrads focus on their work and be as productive as possible. Thank you so much!”
EXERCISE:
1. Open a new session with ChatGPT, making sure that you are in GPT-4 mode.
1.1 Provide ChatGPT with prompt 1 from above.
1.2 Assess the quality of the response using the RACCCA scale and make a note of your assessment.
2. Open a new session with ChatGPT, making sure that you are in GPT-4 mode.
2.1 Provide ChatGPT with prompt 2 from above.
2.2 Assess the quality of the response using the RACCCA scale.
3. Open a new session with ChatGPT, making sure that you are in GPT-4 mode.
3.1 Provide ChatGPT with prompt 3 from above.
3.2 Assess the quality of the response using the RACCCA scale.
4. Open a new session with ChatGPT, making sure that you are in GPT-4 mode.
4.1 Provide ChatGPT with prompt 4 from above.
4.2 Assess the quality of the response using the RACCCA scale.
Reflect on the quality of each prompt and the response it led to. How does the RACCCA framework help differentiate between the quality and usefulness of different promps and responses?
By the end of this exercise you should have a better sense of how to use the RACCCA framework to distinguish between prompts and responses that provide more or less useful information.