Show me your ML development pipeline and I’ll tell you who you are
Error analysis and model monitoring: the two stages that shape the ML behind your product
5 minute read · March 7, 2022, 9:08 PM
Working with machine learning (ML) models can be challenging: not only the problems are complex, but coordinating the efforts of different teams around all the activities that contribute to the development and deployment of a model can waste a lot of resources if not done properly.
This is why organizations need processes. They must follow a clear and well-defined development pipeline, where the transition from one stage to the next is motivated by rational decisions.
From a bird’s eye view, most organizations follow the same pipeline, similar to the one shown above. However, the set of activities and prioritized objectives at each stage matter a lot. Consequently, despite their high-level similarity, teams fall somewhere within a broad spectrum, with teams shipping high-quality models that impress their users in one end and teams that blindly deploy models that fail silently in production in the other.
In this post, we will explore error analysis and model monitoring, two of the stages that heavily dictate to which end of the spectrum a team is closer to. Although these two stages share the goal of supporting teams to ship the best possible model, they do not contain interchangeable activities by any means.
First, we discuss the purpose and kinds of activities comprised in model monitoring. We then motivate the necessity of error analysis with a thought-provoking question. To make matters more concrete, we briefly go through a recent famous case that illustrates the consequences of having a model that fails in production. At the end of this post, you will find a table that compares error analysis and model monitoring side-by-side, which might help you identify gaps in your ML development pipeline.
Join thousands of practitioners and enthusiasts learning the secrets of building performant and explainable ML!
And hey — if you are already familiar with the difference between these two stages and want to start conducting systematic error analysis, check out Unbox!
Model monitoring: the last line of defense
The world is constantly changing, so the scenarios and patterns that a model picked up during the training stages may, sooner or later, become outdated. Therefore, it is fundamental that models are monitored in production. If not, there is a high risk that they will start failing silently and, over time, become less and less useful as performance degrades.
Monitoring is an ample field, with a wide variety of activities that strive to cover different aspects of the ML system serviced in production.
There are activities that focus on the model’s environment, making sure that all of the scaffolding around the model continues to work properly over time. Some examples are running checks that ensure the data pipelines are functioning correctly and checking if the data schema that the model is expecting suffered modifications.
Other activities are centered on the model itself, to verify its performance, measure health metrics, and check if the set of statistical assumptions over which the model is built is still valid. This is where ML engineers might look for incoming drifts (such as data or concept drifts) and spot model performance trends.
The above list of activities is far from exhaustive, but they illustrate a few key characteristics of monitoring. As it is possible to note, model monitoring can be a lifesaver, as it is the last line of defense against errors and changes in the environment that might trigger significant modifications in the system as a whole.
Although ML has idiosyncrasies that differentiate it from traditional software systems, one thing that both fields have in common is that the cost of fixing an error grows exponentially with time, as shown in the chart below (modified from this post).
Currently, a lot of organizations spend profusely on monitoring solutions and over-rely on them to catch mistakes and ensure their models are trustworthy. Needless to say that a lot of resources end up being wasted in this process.
Wouldn’t it be better if engineers were able to catch errors proactively rather than retroactively?
To be able to do so, they must conduct rigorous and systematic error analysis.
Error analysis: the compass
Error analysis is the attempt to analyze when, how, and why models fail. It embraces the process of isolating, observing, and diagnosing erroneous ML predictions, thereby helping understand pockets of high and low performance of the model.
Like model monitoring, error analysis is an umbrella term that is comprised of many activities. Each activity serves a different purpose and helps practitioners not only identify the models’ possible failure modes but also construct a much more comprehensive view of model quality.
Some activities under the umbrella of error analysis include making use of global and local explanations, to understand why a model is behaving a certain way; leveraging the power of synthetic data, to ensure model robustness and invariances under different scenarios; performing error cohort analysis, to check how the model performance differs for different subgroups of the data; making use of counterfactual and adversarial analysis, to possibly identify biases before the model is shipped; and taking testing seriously, with reproducibility and rigor in the leading roles.
In our white paper, we explore each of the above activities in detail, so feel free to check it out if this is a topic you are interested in.
If you look at the ML development pipeline diagram we presented at the beginning, you might notice that error analysis is there, right after model training. Unfortunately, this is a stage that is often neglected.
As we pointed out earlier, being able to catch the models’ mistakes before servicing them in production has an enormous potential to save resources. Mature organizations understand that error analysis should lie at the heart of their development pipeline. The insights that arise from it serve as a compass pointing in the right direction. As Stanford professor Andrew Ng puts it, “if you do error analysis well, it will tell you what’s the most efficient use of your time to improve performance”.
Model monitoring is a must, but is it enough?
Let’s go through an example to answer this question.
Zillow had become the go-to place when it came to house valuation. The Zestimate was an estimate of the price of a house produced by a model from a set of features provided by the user. After a while, Zillow created its iBuyer arm and automatically bought houses based on the Zestimate.
The idea worked for some time and flipping houses relying on the Zestimate seemed to be a profitable business. Eventually, though, the scenario changed and things stopped working.
In 2021, Zillow reported losing over $500,000,000 as a result of buying houses overpriced by their model and selling them at loss. As a consequence, it shut its iBuyer arm and said it would lay off 25% of its workforce.
No one knows for sure who is at fault in Zillow’s iBuyer case. One thing is clear, though: ML models are deployed in dynamic environments, and simply monitoring them might not be enough to guarantee that they are working properly.
Monitoring is the last line of defense, but it is not a silver bullet.
There is often confusion between model monitoring and error analysis. Hopefully, by now you notice that although they share the goal of supporting teams ship the best possible model, they do not contain interchangeable activities by any means.
Below is a table that highlights some of their key differences. This is a resource that might help you identify gaps in your development pipeline.