In today’s fast-paced digital landscape, businesses are increasingly exploring the potential of generative AI (genAI) to drive innovation and efficiency. Yet, as exciting as these advancements are, they come with unique challenges, particularly when it comes to testing generative AI applications.
It’s fundamental to understand that, to consumers, large language models (LLMs) and genAI tools like ChatGPT are essentially black boxes. We give them instructions and inputs and hope for the best. However, we don’t control what happens inside. When we change a system prompt or any other aspect of the system, we hope the model will do what we want. However, experience has proven they can be very unpredictable, and even small changes can result in very unexpected and undesirable outcomes.
Regardless, developers must have the ability to adjust the system prompt or other configurations as they build out applications. This is why a robust test suite that can detect regressions, or other significant changes is so important. This is no different from any other type of application development.
As a leader at Atomic Object’s Chicago office, I’ve seen firsthand the hurdles our teams face when building applications that leverage genAI. Our approach has always been to embrace new challenges head-on, fostering an environment where innovation thrives. Today, I’d like to share some insights and techniques for effectively testing genAI-powered applications. This is a topic becoming increasingly relevant for business leaders as the tsunami of AI washes over us all.
The Unique Challenge of Generative AI
Unlike traditional software, generative AI introduces a pesky layer of non-determinism. This means that the same input can produce different outputs, making conventional automated test suites — which rely on deterministic results — less effective. This inherent unpredictability demands new, strategic approaches to ensure quality and reliability.
Techniques for Testing Generative AI Applications
Let’s look at various techniques for testing genAI applications, from statistical analysis to “do not ever” lists.
Behavioral Consistency Testing
Behavioral testing, also known as black-box testing, focuses on how a software system works as a whole and validates that the application works as expected in specific real-world scenarios. With creativity, this technique can still be applied to applications that leverage genAI.
While exact outputs may vary, the behavior of the AI should remain consistent within its defined parameters. For example, when a genAI model generates text in response to an input, it should consistently convey the same meaning even if the exact phrasing differs. Imagine testing a chatbot’s response to “What is an Atom?” where expected answers could be:
- “An Atom is an employee of Atomic Object.”
- “Atom is a friendly term to describe someone who works at Atomic Object.”
To verify this, the test could first capture the generated answer and then compare it against an expected correct answer. The test then can subsequently ask the model, “Do these two sentences convey the same meaning with a similarity threshold above 95%?” This helps ensure the chatbot maintains semantic coherence even with variable outputs.
Statistical Analysis
You can use several statistical methods to analyze the outputs over multiple runs. For a comprehensive approach, developers could focus testing and analysis on two main aspects: diversity, and relevance.
- Diversity: Measure the variety of outputs generated by the AI. This can be quantified using metrics such as token entropy or n-gram diversity. For example, generate 100 responses to the same input and analyze the frequency of different words or phrases. High entropy or a large variety of n-grams (sequences of words) would indicate greater diversity. If you determine the diversity is too high, you may want to adjust the prompt or input instructions.
- Relevance: Assess whether the generated content is relevant to the given prompt. This can be done by employing human evaluators to rate the relevance of outputs or using automated tools like BERT (Bidirectional Encoder Representations from Transformers) to compare the semantic similarity between the input and output. You can also develop a relevance score that combines these methods to give a more holistic view. If the relevance score is too low, it may mean that the model doesn’t have the context it needs to respond as desired. You may need to leverage model fine-tuning or a RAG (Retrieval-Augmented Generation) setup to increase the relevance.
Example Workflow for Statistical Analysis:
- Generate 100 outputs for the same input prompt.
- Calculate token entropy and n-gram diversity for diversity analysis.
- Use BERT or similar tools for relevance scoring, comparing the generated outputs against a set of expected, relevant responses.
Human-in-the-Loop (HITL)/Exploratory Testing
Given the limitations of automated testing for genAI, incorporating human testers to evaluate AI performance can be extremely valuable. Human testers can provide nuanced feedback on the AI’s outputs, something automated tests might miss. This approach combines the efficiency of automation with the discernment of human judgment. For example, humans can assess whether the tone and style of a chatbot’s responses are appropriate for customer interactions.
Unlike automated tests, human testers can quickly evolve the test plan based on new context and information. While performing tests, exploratory testers often think of variations of test cases or completely new test cases that can catch a wider variety of corner cases.
Fail-Safe Mechanisms and “Do Not Ever” List
Implement fail-safe mechanisms within your application to handle unexpected AI behavior. For instance, setting thresholds or constraints on outputs that could potentially be inappropriate or harmful. This ensures that AI contributions remain within acceptable boundaries, safeguarding against any undesired outcomes.
Additionally, creating a “Do Not Ever” list can be highly effective. This list contains words or phrases the model should never output, ensuring that content remains appropriate and aligned with your brand values. Examples include:
- Inappropriate Content: Words or phrases that are offensive, discriminatory, or otherwise inappropriate for your audience.
- Competitor References: Names or terms related to your biggest competitors that you prefer not to mention.
- Political Topics: Avoid discussing or referencing political opinions or controversial political events.
- Legal and Regulatory Violations: Ensure that any generated content adheres to applicable laws and regulations, avoiding legal liabilities.
System Prompts
GenAI models often leverage a system prompt used to provide the model with detailed instructions on how it should respond. Prompt engineering can be challenging, and doing it well is both an art and a science. Several of the above testing techniques can be useful in validating that your system prompt is leading to the behavior you expect.
For instance, your system prompt might instruct the model, “Don’t engage in discussions about politicians or highly controversial political topics. If asked, just say, “I’m sorry, that’s a topic I’m unable to discuss.” The “Do Not Ever” list mentioned in Section 6 could include terms like “Biden,” “Trump,” “immigration,” or “war in ___” to ensure that, even when provoked, the model won’t output those words or phrases.
Testing Retrieval-Augmented Generation (RAG) Integration
Retrieval-Augmented Generation (RAG) is commonly used to provide additional relevant information to the genAI model for better, context-aware results. Incorporating RAG can significantly enhance the performance of AI applications by grounding responses in factual data. However, this also requires robust testing to ensure the retrieved information is accurate and contextually appropriate.
Testing RAG Integration:
- Intercepting Content: To ensure the RAG step returns specific data in specific scenarios, you should intercept and validate the content retrieved by the RAG process. For example, if a user asks a chatbot about the latest company news, you want to ensure it fetches recent, accurate information rather than outdated or irrelevant data.
- Scenario Validation: Create test scenarios where specific queries should retrieve specific pieces of content. Validate the returned data to ensure it’s correct.
- Relevance and Accuracy Checks: Just as with the generative part of the AI, ensure the retrieved information is relevant and accurate. For instance, if the RAG model is used in a customer service application, check that it returns correct policy information or current product details.
- Consistency Testing: Ensure that the RAG model consistently pulls the correct type of information across multiple runs. This might involve automated tests that simulate numerous user queries and verify the relevance and accuracy of the returned data.
By integrating these best practices into your testing process, you can help ensure that your RAG-powered genAI application provides contextually appropriate and accurate responses, thus creating the user experience you’re after.
Why Business Leaders Should Care About Testing Generative AI Applications
The impressive capabilities of modern generative AI models can quickly instill confidence and trust in their outputs. However, it’s essential to remember that robust testing remains as important today as ever. Despite the advanced nature of these models, business leaders must recognize that effective AI deployment requires rigorous testing strategies. By adopting innovative testing methodologies, organizations can ensure that their software meets high standards of reliability and functionality.
Moreover, these robust testing practices enable seamless upgrades to the generative AI models, ensuring that your application continues to behave as expected. Given the incredibly fast pace at which the AI world is evolving, having this level of confidence and adaptability is highly advantageous.