Unbox named a top 100 AI company of 2022 by CB Insights!

Read more



Use cases




Get started

Error analysis

Systematic error analysis

Going beyond predictive performance

Gustavo Cid's picture
Gustavo Cid

7 minute read · May 3, 2022, 12:41 PM

Trust is of utmost importance if we wish machine learning (ML) models to live up to their full potential. We are amid an era when there are high stakes and high expectations involved, but models are still seen as black boxes. Furthermore, ML practitioners won’t be able to avoid the mistakes their models make unless they start looking beyond aggregate metrics, such as accuracy or precision.

The path towards trustworthy ML models starts with treating error analysis as a central component of the development process.

Error analysis is the attempt to analyze when, how, and why models fail. It embraces the process of isolating, observing, and diagnosing erroneous ML predictions, thereby helping understand pockets of high and low performance of the model.

Error analysis should be seen as a systematic process that encompasses various activities. In this post, we explore some of them, motivating their necessity and providing guidelines that shall help practitioners incorporate them into their ML development pipelines.

Join thousands of practitioners and enthusiasts learning the secrets of building performant and explainable ML!

And hey — if you are already familiar with the different activities under the umbrella of error analysis, feel free to head straight to Unbox to start validating, debugging, and testing your models!

Error cohort analysis

Imagine you have trained a classifier that detects fraudulent transactions based on a set of features such as the dollar amount, the merchant, the number of products being bought, and others. Now, you want to know your model’s performance, so you assess its accuracy on a validation set. The accuracy you obtain is equal to 90%. That’s great and you are feeling proud of your work!

The 90% accuracy, as an aggregate metric, summarizes the performance of your model across your whole validation set. It is a useful first metric to look at, but it certainly doesn’t convey the complete story of how your model behaves.

Is that accuracy sustained across different subgroups of the data? For example, how does your model perform for high-value transactions? What about for transactions with merchants based outside the US?

Notice that from a business perspective, the answers to these questions might be very relevant, so you need to be confident that your model is coherent enough to answer them.

What you will most likely find out is that the accuracy of your model is not uniform across different cohorts of the data. Furthermore, you may even encounter some data pockets with low accuracies and specific failure modes.

If you looked only at the aggregate metric (the accuracy, in this case), you would have a myopic view of your model’s performance and think that it was satisfactory. This is why analyzing different cohorts of the data is critical to building trust in your model and not being surprised by failure modes only after your model is serviced in production.

Global and local explanations

What has a model actually learned?

By looking solely at aggregate metrics, it is not possible to arrive at any answers.

Being able to produce explainable predictions lies at the center of trustworthy ML. These explanations can be global or local and each one provides a distinct perspective to practitioners and businesses.

Global explanations help reveal which features contributed the most to the (mis)predictions made by the model over a dataset. In the churn classifier from the previous section, for example, one might find out that the users’ geography is one of the most important features to predict whether a user will churn or not. Note that this information can be directly translated into business insights. A marketing team, for instance, might decide to create specific campaigns targeting the users from a particular location. It also increases the ML practitioners’ confidence that the model is behaving properly, without over-indexing to certain features.

Local explanations provide insights into the individual predictions and help practitioners get to the root cause of problematic predictions their models are making. The justifications provided by local explanations build confidence that the model is taking into consideration reasonable data to make its predictions.

For example, in a sentiment analysis task in natural language processing (NLP), where a phrase is categorized as positive or negative, if a model predicts that the sentence “I’m having a great day” is positive because of the word “having” and “day” and not because of the word “great”, is it really working?

Not really. Ideally, the model, in this case, should be predicting a positive label particularly because of the word “great”, which is a strong positive word. Predicting the right label is only half of the story.

Explainability is an active area of research. Some models are intrinsically more interpretable than others. For example, a linear regression is more interpretable than a neural network, as in the former, the model’s weights are evident and can be seen as proxies for feature importance, while in the latter, the high number of parameters and cascaded nonlinearities often make it hard to understand what’s going on. There are other approaches that are applicable to a wider set of problems, such as LIME (Local interpretable model-agnostic explanations) and SHAP (Shapley additive explanations).

As Marco Tulio, researcher at Microsoft Research and one of LIME’s creators, puts it, “understanding the reasons behind predictions is quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model”.

When used in combination with a systematic error cohort analysis, explanations can be an important step toward understanding why the model might not be performing so well for a specific subset of the data.

Counterfactual and adversarial analysis

Will a model change its prediction if the values of a set of features are varied in an unforeseen way?

It probably will. The question that should be asked, then, is if such changes are desirable or not.

For example, in a model that assesses the credit risk of a user, the model’s predictions should change depending on the user’s income. But should it vary for different users’ genders, all other features being equal? What about for distinct users’ ethnicities?

It certainly shouldn’t.

Even though the answers are no-brainers, not all situations are as straightforward as that. There are massive biases and failure modes hidden within industry-standard and academic-grade models and datasets.

A good error analysis procedure should test the adversarial changes and try to find counterfactual examples where the model is not performing correctly. Without a proper counterfactual and adversarial analysis, these behaviors are hard to anticipate so that corrective actions can be taken in time.

For instance, a few years ago, a seminal paper was published indicating severe vulnerabilities of computer vision systems based on deep learning. It was shown that several systems were prone to adversarial attacks that were imperceptible to humans, but could easily fool the models. These results reinforce the importance of incorporating adversarial and counterfactual analysis into the error analysis procedures.

Synthetic data

One of the most challenging parts of building ML models is figuring out all the edge cases. Your training, validation, and test sets represent only a small fraction of the kinds of examples that your model will encounter out in the wild, after deployment. How can you prepare beforehand to deal with categories that are often underrepresented in the samples that you have?

Generating synthetic data to augment underrepresented portions of your training data is a great way to increase your model’s robustness, ensure important invariances, and further explore specific failure modes.

For example, in a sentiment analysis task in NLP, the phrases “John is a happy person” and “Mary is a happy person” should be classified equally. One can generate a large number of synthetic samples following the template “{name} is a happy person”, where “{name}” is replaced by various common first names, to ensure the model is invariant to first names when performing the task.

There are various ways of generating synthetic data, depending on the datatype of interest. It can be as simple as generating samples from a template, as in the previous example, or as complex as using generative adversarial networks (GANs).

Original data samples can also be perturbed to augment the dataset. In NLP, this can be done, for instance, by introducing small typos (which encourages model robustness to typos) or replacing word tokens with synonyms. In computer vision applications, this can be done by adding noise to sample images, changing image orientation, among a plethora of other ways. Each area has its own idiosyncratic methods of data perturbation for data augmentation.

The insights from the counterfactual and adversarial analysis can also be used to generate synthetic data, giving rise to adversarial examples. Being able to quickly come up with a set of adversarial examples either from a hypothesis or from modifications from the original dataset contributes to more robust and error-proof models.

Systematic testing

Test-driven development is common practice in software engineering. In ML, a field not that far away, tests are not as common as they should be.

Testing in ML (if done at all) is usually comprised of a single engineer writing a script to test a few cases that came up during a sloppy error analysis procedure. Thorough testing goes a long way in ensuring model quality, helping practitioners catch mistakes proactively rather than retroactively.

For example, borrowing insights from software engineering, Marco Tulio et al. to proposed the CheckList: a new testing methodology for NLP models. This work shows that “although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models”. Moreover, “NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it”.

Notice that each facet of error analysis presented so far can be systematized into unit and regression test frameworks.

Building over the ideas of error cohort analysis, one might define performance thresholds for certain subgroups of the data and test whether the model surpasses them for a dataset. From counterfactual and adversarial analysis, it is possible to build tests that strive to flip the predictions made by a model by manipulating the feature values. Finally, it is possible to create synthetic test samples that ensure the model’s predictions remain invariant in particular scenarios.

Error analysis is a topic that we deeply care about at Unbox, as we build the tooling that makes model debugging, testing, and validation easy and intuitive. This post is one of the chapters from our white paper. Feel free to check it out for pointers on how to incorporate systematic error analysis procedures into your ML development pipeline.