Understanding each piece in the ML infrastructure stack
Making sense of ML systems
Gourav Singh Bais
10 minute read · May 19, 2022, 4:39 PM
When machine learning (ML) engineers or data scientists work on any ML solution, they progress through a cycle of building, deploying, and scaling the solution. Data and infrastructure are two of the major areas of focus in building a pipeline for better model performance. Larger amounts of data lead to a higher level of model performance and a greater need for infrastructure.
The machine learning infrastructure stack is the combination of all components required to build, deploy, and scale the machine learning systems. ML infrastructure includes all kinds of hardware, software, and other utilities.
As a machine learning engineer or a data scientist, you’ll work on a lot of projects throughout your career. These projects will present you with a multitude of technology options, including those covering varied functionalities or only a few or individual functionalities. Tools can be open source or paid, and some may require a high level of expertise or be simple enough for beginners to get started with. All these tools and technologies depend on your use cases and your level of experience, and having a proper understanding of the ML infrastructure stack will make it easier for you to choose suitable technologies for your project.
In this article, you will learn about each piece of ML infrastructure stack in detail and explore some of the tools and technologies recommended for certain categories.
Join thousands of practitioners and enthusiasts learning the secrets of building performant and explainable ML!
The importance of understanding the ML infrastructure stack
Usually, the ML infrastructure stack comprises three layers:
1. Data layer: This layer comprises all the tools required to do the data preprocessing, from data gathering to data analysis, data cleaning, and data visualization. Tools such as SQL, Python, pandas, and Tableau are part of this layer.
2. Model layer: This layer is responsible for building the machine learning models for your use cases. These use cases include model experimentation, logging, and version control. Tools like Python, Jupyter Notebook, and Git are part of this layer.
3. Deployments layer: Once you are done with all the data preprocessing and model experimentation, you get a final model that is handed over to the deployment team as an artifact. This layer contains all the tools required to deploy the models, such as AWS, Azure, Kubernetes, or Jenkins.
Now that you know what the ML infrastructure stack is and have an understanding of the three layers that comprise it, let’s take a look at some of the important use cases of the ML stack:
- It allows you to choose which tools you want to use, as well as how they connect. When machine learning engineers work on any use case, they typically go through a series of steps wherein each step requires different tools and technologies. In the development of a project pipeline, each step is closely interlinked with the others—the output from one stage becomes the input for the next stage. Thus, it’s important that these tools be selected in such a way that they are easily embedded together to complete the project pipeline.
- It provides you with the facilities required to do effective ML operations. Machine learning solutions are built with the help of effective data analysis, data visualization, model experimentations, and successful deployment. Tools like Python and SQL make it easy for ML engineers to do the data gathering and analysis part, while tools like Tableau and WEKA make visualizations easy and effective. sci-kit learn and TensorFlow technologies help ML engineers easily create machine learning and deep learning models. Finally, platforms like AWS, Azure, and Google help them to deploy the models.
- It allows you to compare different tools. When someone develops any traditional software, there are many programming languages to choose from to write that application. Similarly, for a machine learning use case, each stage has several tool options based on your specific needs. By comparing different tools, you can select the best combination for your particular use case.
The various pieces of the ML infrastructure stack
The traditional way of developing a machine learning pipeline is to create a model using an ML stack. Here, data scientists and machine learning engineers develop models that are then passed to the operations team for deployment. However, miscommunication among teams can sometimes lead to project failure. A new technology is now being used by many organizations to tackle this issue: MLOps is the combination of both machine learning and operations, and it makes ML pipeline development easy. Technologies used in both the ML stack and the MLOps stack are almost identical, though MLOps does incorporate some extra tools to make pipeline development more robust.
Image courtesy of INNOQ
Let’s take a look at the different pieces of the MLOps stack to get a better understanding of the role of each tool involved in ML pipeline development.
Data is the foundational element when developing machine learning solutions. If you do not have a good amount of data—or any data at all—having the right tool or technology to create the solution is irrelevant.
Once you are done with requirement analysis and have your data ready to use, you need to analyze the data well before commencing the model building phase. Data collected for any use case cannot be used directly, as it may not be uniform or it could contain some corrupted values. The data is first cleaned to remove those corrupted values, and then the data is transformed into a format suitable for model training. For data cleaning and transformation, there are several tools available.
Python is an open-source programming language often preferred by data scientists. It provides different libraries that can be directly imported into a program, which reduces the effort of implementing different concepts and machine learning algorithms. Another choice for a programming language by a data scientist is the R language.
While working with tabular data, you need to read it using Excel and CSV files. The pandas library not only helps you in reading these files but also provides a lot of functions that you can perform on those data frames, such as merging, replacing values, searching elements, and so on.
Once your data is cleansed and transformed, your next task is to find a machine learning model that performs best on your data. Getting to the best model involves trying out different models with a different set of parameters. This is referred to as model experimentation, which is done with the help of a programming language; you need to write different code snippets to try out different models. To write the code, data scientists commonly use two different tools:
Jupyter Notebook is a web-based platform that allows you to write and compile code in small blocks called cells. These notebooks are very helpful for data cleaning and data exploration phases, as you need to look at your data and graphs after each line.
Code editors are generally preferred by software engineers, but data scientists and ML engineers also use them at the end of the model-experimentation phase since applications can’t be directly deployed in the form of notebooks. Code editors help developers convert code snippets from notebooks into Python files. There are plenty of options available for code editors; you can check out a list of those widely used by data scientists and machine learning engineers here.
Imagine you are working on a project and have tried a particular model out for a day. The next day, your system restarts, which means you have to repeat the data cleaning and data transformation phase all over again. Feature stores can help you avoid having to redo all that work. After preprocessing your data, you can store it in a feature store, giving you the opportunity to reuse it whenever it’s needed. Some examples of feature stores include the following:
Databricks Feature Store is built around Spark DataFrames. It provides a UI for easy interaction and also supports SQL querying.
There are various other options available for feature stores, which you can check out here.
Data scientists typically don’t work alone; many ML solutions are developed in a team environment. If you’re working on one module of a project while other members are working on other modules, at the end of the day, all the changes need to be merged to form the system. However, you also need to manage different versions of your applications, as you may include new features in your app over time. These processes cannot be done manually using file transfer or a local file store. However, using version-control tools, you can achieve both of these functionalities.
A machine learning pipeline is a series of steps needed to build, deploy, and scale your model. It’s a very crucial thing for an organization, as it comprises all the stages required to create web-based or API-based applications. Each stage’s completion may require different libraries and dedicated hardware resources in order to make the ML project successful. To learn more about machine learning pipelines in detail, you can refer to this article.
Kubeflow is mainly used for the deployment purpose of machine learning applications. If you want to deploy your applications as containers using Kubernetes, Kubeflow is a great choice.
A model registry is a repository that is used for the version control of trained machine learning models. They are responsible for keeping track of ML models as they move from team to team, starting from training and all the way to deployment and retirement. It’s very difficult for data scientists to keep track of all the things they have tried throughout a project lifecycle; the model registry helps them to document all the models and parameters by providing each with a unique ID. Examples of popular model registries include the following:
SageMaker Model Registry is a part of multiple offerings from Amazon’s SageMaker. It can be used to easily store models and their different parameters at any time during development.
MLfLow Model Registry is an open-source model registry. It has multiple offerings like model versioning, stage transitions, annotations, and more. All these features help in managing the machine learning models easily during development.
To learn more about model registries in detail, take a look at this article.
When data scientists develop machine learning modules, those modules can be made available using APIs to be incorporated into other applications as well. Consider an example: You have created a face-detection model for marking employee attendance in an office, and later you are working on a new use case like KYC (“know your customer”) verification. For this new project, you don’t need to start from scratch to implement facial detection—you can easily import your previously created face-detection model for this new use case. Model serving is simply the process of hosting machine learning applications on the cloud (or on-premise) and making them accessible in the form of APIs so that they can be incorporated into other applications.
Once you are done with the model development and deployment, the work isn’t over. There may be scenarios where your model is performing very well in training but is failing miserably in the real world of data. The only way to know for sure is to monitor it continuously for a specific period from deployment. This process is called model monitoring, and it is the best way to ensure an acceptable level of performance from your model.
When you train a machine learning model, there is a lot of information generated, including model parameters, evaluation metrics, data-set versions, pipeline output, weight files, model-architecture files, and so on. This information is called model metadata, and it needs to be stored so that you can easily use it again as needed.
In the real world, you’ll often see use cases where positive class samples occur far less often than negative class samples. One great example of this is in credit card fraud detection. In this use case, you might notice that the number of true fraud cases is significantly lower than the number of nonfraudulent cases. Here, negative/nonfraudulent classes are called majority classes, and positive/fraudulent classes are called minority classes. These highly skewed proportions are referred to as imbalanced data. Due to these imbalanced classes, machine learning models are prone to overfitting and poor performance.
One of the solutions to this problem is to use synthetic data for minority classes. In the process of synthetic-data generation, new data points are generated for minority classes in order to bring more balance to the class samples. To learn more about synthetic data and how you can generate it, you can refer to this article.
Unbox provides a wide variety of solutions for machine learning, including synthetic-data generation, data cleansing, and the relabeling and tuning of model architecture to improve the overall performance of your ML models.
In this article, you learned what an ML infrastructure stack is and what tools and technologies come together to create it. When data scientists and machine learning engineers work on any use case, their focus is typically on the problem statement and the machine learning models to solve that problem and not necessarily the hardware and software needed to do so well. However, as you’ve seen in this guide, selecting the right infrastructure is essential to developing and deploying any solution (machine learning or traditional software) to make an ML project successful.
As you’ve seen, choosing the right tools for your ML infrastructure stack is not a one-time process that works for all use cases within your organization—each type of machine learning solution requires a different kind of tech stack. Be sure to check out the resources mentioned in this article to learn more about the different tools available to create the right combination of technologies for your particular use case.