The importance of model versioning in machine learning
Systematically iterating through ML models
5 minute read · June 28, 2022, 12:49 PM
Version control is an essential concept in software development. Version control tools, such as Git, are used to manage and track source codes. Developers and organizations alike value source code because it is the backbone of their product(s), from which they earn revenue. Model versioning is the application of version control concepts within the machine learning process.
Machine learning and artificial intelligence offer an edge to businesses and developers over their competitors, motivating the competition to push to build the best model as it provides a business advantage. As developers optimize for speed and accuracy, minimizing errors also becomes a priority. Model versioning is a great tool for all of those purposes.
In this article, you’ll learn more about the importance of model versioning in machine learning by exploring a demo on how to apply it when building ML models and reviewing best practices to follow, limitations to be aware of, and some of the best tools currently available.
Join thousands of practitioners and enthusiasts learning the secrets of building performant and explainable ML!
What is model versioning?
Model versioning derives its core processes from version control in software development. The difference is that software development runs mainly on code, while machine learning runs on both code and data, which is the backbone of the best-performing models.
Model versioning, therefore, is the process of tracking and managing changes made to source code and data, including metrics, parameters, and hyperparameters. Popular version control systems like GitHub, Bitbucket, and Perforce have failed to track changes made to data and are limited in the amount of data they’re able to host due to storage limits. Solutions such as Git Large File Storage (Git-LFS) can accept a large amount of data but still fail to track data and models.
Applying model versioning when building machine learning applications is important because it allows:
- referencing previous versions of the model;
- reverting changes quickly, making building models less risky;
- easily reproducing and sharing models among development teams.
Benefits of model versioning
ML models are generally built by developers across a variety of teams, which requires a high level of collaboration. Model versioning then becomes necessary as it permits teams to do the following:
- Build reproducible models: Building machine learning models is nondeterministic. In other words, training the same algorithm on the same data, as done in the past, can produce a different result due to the change in parameters and hyperparameters. Tracking data and everything involved in the model-building process becomes possible with model versioning, allowing for the convenient reproduction of models.
- Build shareable models: Working on machine learning projects requires collaborative efforts depending on the size and complexity of the project. Model versioning allows teams to share models by saving files in a remote storage location and recording/tracking each file that produced a model. Teams can constantly update and share models among themselves.
- Ensure proper data governance: Industries that deal with sensitive information, such as healthcare, need data governance to ensure that data is handled with care and that it remains consistent and accurate. Model versioning allows for proper model auditing to ensure compliance with government laws like the GDPR.
Best practices for model versioning
In order to prevent having bugs in production, which can break the system, here are some best practices to follow:
- Review models regularly: Regularly reviewing your model will help you spot errors that could otherwise cause it to break or produce inaccurate predictions. Be sure to keep an eye out for outliers and data points that can weaken or change your model’s performance.
- Regularly delete models that are not in use: Keeping obsolete models uses up space and causes redundancy. Deleting these models provides clarity and ensures that your resources are invested only in models that are currently in use.
- Test models before deployment: Testing models before deployment is crucial to help you identify errors that may otherwise lead to downtime, operation costs, and for a commercial product, loss of users.
Model versioning is still in its early stages, with tools like DVC and MLflow made available for public use in 2017and 2018, respectively, compared to Git, which was built in 2005. Thus, tools used for versioning—especially open source tools—are not yet built to support end-to-end model versioning. For instance, some prioritize tracking model changes over robust storage options (or vice versa). Similarly, some model versioning tools come with pipeline management, leading to data redundancy in cases where your team already has a pipeline system in place.
Tools for model versioning
To tackle the challenges of model versioning and solve the limitations of versioning software such as Git, several open-source and enterprise solutions have been built. Here are a few useful tools to consider:
- Unbox: Unbox is a collaborative quality assurance platform for machine learning that allows you to track and version your models with ease. It helps you discover errors when building models, boost model performance, and test your model before deployment. If you’re building an end-to-end machine learning project, it may be just the right tool for you.
- DVC: Data Version Control (DVC) is an open-source tool used for versioning data sets. DVC commands are similar to Git’s, and both are used in tandem when building models. DVC versions large data sets, while Git stores the .dvc files used to retrieve data from DVC remote storage. DVC supports popular remote storage options like Amazon S3 and Google Cloud Storage and is a great open-source option to version your data.
- LakeFS: LakeFS is an open-source platform similar to DVC that provides Git-like commands for performing model versioning operations. It also integrates storage options like AWS and GCP to store data. LakeFS minimizes data duplication via a copy-on-write mechanism and is well suited for enterprise use due to its high performance over data lakes of any size.
Model versioning is a valuable part of the development process for machine learning projects because of the need for collaboration, tracking changing codes, and monitoring the model’s performance over time. If you’re considering what tool might be right for you and your team, here are a few guidelines to keep in mind. The ideal tool will have the following features:
- Ease of use: Model versioning tools should be easy for you and your team to get familiar with. Machine learning is complex enough without these tools—using them should make building models simpler, not more complicated.
- Good stack integration: Whatever tool you choose to use should integrate well with your tech stack for smooth building. Fortunately, most tools built for model versioning are language-agnostic, so finding one that meets your needs shouldn’t be difficult.
- Data set structure support: Choosing a tool or platform that supports the kind of data you are working with ensures faster model development. Versioning tabular-based data is different from versioning audio, video, or image files. You’ll want to pick a tool that supports the data type you work with now—as well as any others you plan on working with in the future—to avoid needing to switch platforms later on.
If you’re in the market for a great ML platform that allows you to expertly version and track your models, be sure to consider Unbox. Get started today to learn about how Unbox can help you with error analysis, synthetic data generation, testing, deploying models, and more.