Ready to automate your workflows?
Discover how we can transform your complex challenges into intelligent execution.
Testing software in most cases is black and white. A feature either meets the outlined criteria or it doesn’t. However, testing AI systems is not always this straightforward. It may be that the acceptance criteria are met but the results aren't quite up to scratch. What the AI comes back with might not be what you expected, but it is technically not incorrect.
Unpredictable AI
It can be common to get an output you don’t expect from an AI system. It may even behave differently when provided with the same inputs. This means it is key to test the overall robustness and reliability of the system, not just the outputs to particular inputs.
Making tweaks and changes to models to improve answers can drastically change responses from the system in an unintended way. With these issues in mind, it is often useful to rate outputs on certain factors rather than making an arbitrary call on whether it is acceptable or not. If the model is ranked on these factors with the same inputs while changes are ongoing, it makes it easier to compare the impact of code changes on the model.
Endless Possibilities
It is important to repeat inputs in the AI system to test for reliability. However, when selecting the input for testing, there are endless scenarios that an AI system may come across. This makes it physically impossible to test them all. While this is common for a lot of testing circumstances, there is more of an unknown with AI responses. It is key to protect users against harmful or unethical information which the AI may produce, so covering as many of these core circumstances as possible is essential. Nevertheless, it is impossible to cover them all.
Industry Consensus
As things stand, experts in the field draw the same conclusion; no one really knows what the best solution is - yet. Although no one has all the answers to solving the conundrum of how to test AI effectively, it’s important to keep the conversation going to keep up to date with new techniques and advances in testing.
In Part 2, we’ll discuss how how AI can not only streamline traditional testing but also be used to test AI systems, offering innovative approaches to tackling the challenges discussed here.
More from our team
Keep Reading

NVIDIA GTC 2026: Translating Keynote Hype into Enterprise AI Value
Discover strategic insights from NVIDIA GTC 2026. Razor's DataQI team cuts through the noise to explain what NemoClaw, Vera Rubin, and Agentic AI mean for UK businesses.

The Connected Organisation - Insight Beyond the Shop Floor
Discover how to bridge the gap between the shop floor and the boardroom. Learn how Agentic AI transforms disconnected manufacturing data into a unified, actionable connected organisation.

The Silver Tsunami - Fighting the Tide
Discover how to capture and operationalise tribal knowledge before your most experienced workers retire. Learn how Agentic AI helps manufacturers combat the Silver Tsunami.
Diving deeper into Razor
Your Next Move

AI Activation Plus
Uniting comprehensive strategic understanding, clear roadmap planning, and immediate action. AI Activation Plus delivers a rigorous readiness assessment and immediately builds a working Proof of Value.

AI Activation
Bridge the gap between AI ambition and operational reality. Rapidly identify high-value opportunities and leave with a clear, prioritised roadmap you can act on.

Technical Strategy & Digital Roadmap Consulting
What got you here will not get you there. We build granular, actionable technical strategies and clear roadmaps to modernise platforms and drive enterprise growth.
