Artificial Intelligence is transforming the technology landscape of the digital age. The world is moving towards the adoption of AI-powered smart systems which will increase exponentially over the next few years. While we see the advancements, the key challenge would be the testing of Artificial Intelligence/Machine Learning (AI/ML)-based systems.

There are 3 major challenges in testing AI systems:

Test Case Designer cannot do much about the first one, so we will talk primarily about the benefits related to data availability and quality. After all, 80% of a scientist’s time is spent preparing the training dataset.

We will use the phase classification from Forbes:

TCD applicability to QA in different phases of AI development

AI algorithm itself

Low

Hyperparameter configuration

Low

Training, validation, and test data

Medium

Integration of the AI system with other workflow elements

High

The rest of the article covers phases 2-4 in more detail. Regarding phase 1, significant customization of the algorithm code is not as prominent and, to borrow the quote from Ron Schmelzer, “There’s just one way to do the math!” so the core value proposition of Test Case Designer to explore possible combinations is not as relevant (i.e., low applicability due to the “linear” nature of operations).


Phase 2

The general idea is to include each hyperparameter in the TCD model, breaking down the value lists based on the thresholds derived from theory or practical experience.

Reference


The specific ranges and value expansions on the screenshot are for example purposes but should sufficiently communicate the “identity” of the approach. Further, constraints and risk-based algorithm settings can be used to control the desired interactions:


Or you could use the 4-way setting to get the full scope of possible combinations.

Strength: Systematic approach to identifying relevant hyperparameter configuration profiles.

Weakness: May explore the profiles with too many changes at a time or require numerous constraints to limit the scope.

Phase 3

Robo-advisors are a popular application of AI/ML systems in finance. They use online questionnaires that obtain information about the client’s degree of risk aversion, financial status, and desired return on investment. For this example, we will use Fidelity GO.

To build the corresponding model in TCD, you will need to forget (temporarily) some of the lessons about parameter & value definitions given different objectives. Instead of optimizing the scenario count, the goal of this data set is to become a representative sample of the real world and eliminate as much human bias as possible. This means not just data quality but also completeness.

Such a model would include all parameters regardless of the impact on the business outcome and utilize lengthy, highly detailed value lists (often more than 10 per parameter). To distinguish between the review and the “consumption” formats, value names or value expansions can be adjusted accordingly (i.e., the value name can be “sell some” for communication to stakeholders while the expansion can be “3” given the data encoding).

When it comes to the TCD algorithm strength selection, the highest available option is typically the most desired one (see the caveat in the “Weakness” below):


When this approach is used for generating the validation + test data sets, the TCD Analysis capabilities (in addition to standard statistical methods) can be used to evaluate the diversity of the split:


Strength: data sets are intelligently built to test all relevant permutations and combinations to deduce the efficiency of trained models while minimizing bias. Further, the regeneration of such data sets is much faster and easier.


Weakness:

1. The current scope limitation is 5000 scenarios per Test Case Designer model which may not be sufficient for training or even validation purposes of some AI systems.

As a side note, while “all possible permutations” is a nice goal, it is often not the optimal one – even for representative purposes, having 289,700,167,680,000 scenarios (which is the possible total for the model above) will not be realistic to perform training on. So, the “right” answer still requires balance and prioritization.

2. Despite certain workarounds, programmatic handling of complex expected results would likely require complementary manual effort.

3. The approach depends on the overall ability to leverage synthetic data instead of production copies which may or may not be feasible in your environment.

Phase 4

This phase is the closest to TCD’s “bread and butter.” The model would serve a dual purpose – 1) smoke testing of the AI; 2) integration testing of how it is operationalized.


Given the execution setup, you would likely have to keep all the factors consumed by the AI system but, for this phase, reduce the number of values based on the importance (both business- and algorithm-wise).

Scenario volume would still be largely driven by the “standard” integration priorities (i.e., key parameters affecting multiple systems). Still, the number of values and/or the average mixed-strength dropdown selection would be higher than typical.


Focusing on the “just right” level of detail for the high-significance factors will guarantee the optimal dataset for sustainable AI testing.

Strength:

  1. Test Case Designer at its best with the thoroughness, speed, and efficiency benefits.
  2. Ability to quickly reuse model elements from Phase 3 and models related to other systems (e.g., the old version of the non-AI advisor for systems B and C).
  3. Higher control over the variety of data at the integration points and the workflow as a whole.


Weakness: Similar to Phase 3 but usually more manageable given the difference in goals (volume in P3 vs. integration in P4).

Conclusion

To summarize, the applicability level by phase is repeated below:

  AI algorithm itself

Low

  Hyperparameter configuration

Low

  Training, validation, and test data

Medium

  Integration of the AI system with other workflow elements

High


From another perspective, using this stage classification from Infosys, Test Case Designer can deliver the most significant benefits in the highlighted testing areas:

Given the typical scale of AI projects, possible inputs and outputs combinations will be almost indefinitely high. Moreover, the techniques used to implement self-learning elements are very complex.

Therefore, fully testing these kinds of applications would not be feasible. To overcome this challenge, we need to think more critically about a systematic, risk-based test design approach, such as the one that TCD facilitates.



  • No labels