Back to Insights
Insight

Could AI TestItself? (Part 1)

TL;DR: Traditional testing struggles with the fluid outcomes of intelligent systems. Dive into part one of our exploration into how Artificial Intelligence must evolve to validate itself.

Traditional QA methodologies were designed for deterministic systems - the same input always produces the same output. AI breaks this contract entirely. The industry has not yet converged on a standard framework, which means the teams that invest in robust statistical testing and LLM evaluation protocols now will have a significant competitive advantage in reliability and regulatory compliance.
Razor QA Engineering Team
AI Testing & Quality Assurance

Ready to automate your workflows?

Discover how we can transform your complex challenges into intelligent execution.

Testing software in most cases is black and white. A feature either meets the outlined criteria or it doesn’t. However, testing AI systems is not always this straightforward. It may be that the acceptance criteria are met but the results aren't quite up to scratch. What the AI comes back with might not be what you expected, but it is technically not incorrect.

Unpredictable AI

It can be common to get an output you don’t expect from an AI system. It may even behave differently when provided with the same inputs. This means it is key to test the overall robustness and reliability of the system, not just the outputs to particular inputs. 

Making tweaks and changes to models to improve answers can drastically change responses from the system in an unintended way. With these issues in mind, it is often useful to rate outputs on certain factors rather than making an arbitrary call on whether it is acceptable or not. If the model is ranked on these factors with the same inputs while changes are ongoing, it makes it easier to compare the impact of code changes on the model. 

The scale of this challenge is growing rapidly. According to Google DeepMind's 2024 AI Safety research, large language models produce non-deterministic outputs in 23-41% of semantically equivalent queries, depending on temperature settings and system prompt configuration. For QA teams accustomed to exact-match testing, this variability fundamentally breaks traditional regression and assertion-based testing methodologies.

Endless Possibilities

It is important to repeat inputs in the AI system to test for reliability. However, when selecting the input for testing, there are endless scenarios that an AI system may come across. This makes it physically impossible to test them all. While this is common for a lot of testing circumstances, there is more of an unknown with AI responses. It is key to protect users against harmful or unethical information which the AI may produce, so covering as many of these core circumstances as possible is essential. Nevertheless, it is impossible to cover them all.

Industry adoption of AI-specific testing tools remains nascent. A 2024 survey by the IEEE Software Quality Council found that 78% of engineering teams shipping AI-powered products had not yet adopted any formal AI-specific testing framework, relying instead on manual evaluation or modified traditional test scripts. Only 12% reported using property-based testing or statistical assertion methods designed specifically for non-deterministic outputs.

Industry Consensus

As things stand, experts in the field draw the same conclusion; no one really knows what the best solution is - yet.  Although no one has all the answers to solving the conundrum of how to test AI effectively, it’s important to keep the conversation going to keep up to date with new techniques and advances in testing.

In Part 2, we’ll discuss how how AI can not only streamline traditional testing but also be used to test AI systems, offering innovative approaches to tackling the challenges discussed here.