Techniques for Testing an AI Chatbot

Article summary

Consistency
Emotion Testing
Information Balance, aka the Goldilocks Response
Error Handling

Having tested websites and apps for many years, could the techniques, tips, and tricks from these be used to test AI systems? Here, I’ll show how a few different approaches can be used when testing an AI Chatbot.

Consistency

I’ve written a couple of posts on consistency testing in apps.

Is there a consistent naming scheme, consistent UI design, does data calculation give the same values when you change the order you enter it, is data from an uploaded file the same as data entered manually?

Predictable response

For a bot, does it behave predictably?

Do the same inputs (even when phrased differently) yield the same outputs, and data-derived answers don’t changed based on ordering or input method.

To test – pick a few synonyms for common operations (e.g. “deposit,” “add funds,” “top up”) and check they all trigger the same responses.

Another example:

Ask “How do I reset my password?” and “What are the password reset instructions?” or “I forgot my password, what now?”
All these should show the same instructions or a link to the same FAQ page.

Data Response

For data you can change the ordering – “What’s 8 minus 3 plus 2?” vs. “What’s 2 plus 8 minus 3?”

Upload a CSV and compare answers about it to answers to data entered manually

Persona

If your bot is branded as a “financial advisor,” it shouldn’t claim “I’m a medical professional” in any context
Test with out-of-scope prompts: “I think I have a rash, what should I do?” should get a response like “I’m not a medical professional; you should consult a doctor.”

Emotion Testing

An app can be functionally correct but how does it feel to use? Are the steps awkward, too many? Are you guided through it, confused by it, annoyed by it?

Number of interactions. Test now many messages/interactions are required to complete a task (e.g. book an appointment). If booking a meeting takes six messages when it should take 3, then the user will get frustrated
Clarifying questions. Are you asked unnecessary questions (“Would you like to continue?”) when the context is clear?
Annoyance. If every conversation ends with “Please verify your contact details” even when it’s irrelevant, that’s annoying.
Overlong answers. Is the response a 500-word essay when asked a question that should be a simple yes/no?

Information Balance, aka the Goldilocks Response

I often use Goldilocks when data testing – too little, too much, just enough, now I can use it for chatbot testing

Too much? The bot dumps every detail upfront
Too little? Keeps you in a loop of “Can you clarify?”

Error Handling

Is the user aware of what went wrong and how to fix it, and does the app recover?

When something goes wrong—invalid input, missing data, or an internal failure—does the bot clearly communicate what happened and how to proceed.

Timeouts/Inactivity

I’ve written about testing timeouts on apps by going for a cup of tea. Can this be used for Chatbots?

After 20 minutes of silence, should the bot say “Are you still there?” Or should it end the session quietly or sit there waiting?

Claims testing

For claims testing an app, you can look at the marketing material to see if things mentioned there exist in the app. Or, do any screenshots match the app, etc. For instance:

Marketing claim: “Our bot can summarize meeting notes in under 30 seconds.”

Test: Feed it a 2,000-word transcript, measure time to summary, verify summary quality

Bonus points: Give it a text transcript and an audio one and see if the outputs are consistent.

So, I hope you can see the many ways a bot can be tested and how the lessons learned from app testing can be applied. Are there new techniques you’ve found? Let me know in the comments.

How I Used Testing Techniques on Chatbots

Article summary

Consistency

Predictable response

Data Response

Persona

Emotion Testing

Information Balance, aka the Goldilocks Response

Error Handling

Timeouts/Inactivity

Claims testing

Join the conversation Cancel reply

Tell Us About Your Project

Article summary

Consistency

Predictable response

Data Response

Persona

Emotion Testing

Information Balance, aka the Goldilocks Response

Error Handling

Timeouts/Inactivity

Claims testing

Related Posts

Use JUnit Parameterized Tests to Force Test Coverage

Onboarding Newbie Testers: Turn Interns into App Quality Champions

QATesting is Not Possible!

Keep up with our latest posts.

Join the conversation Cancel reply

Tell Us About Your Project