Preface
Introduction
Chapter 1: Building Machine Learning Systems
Chapter 2: Machine Learning Pipelines
Chapter 3: Your Friendly Neighborhood Air Quality Forecasting Service (available)
Chapter 4: Feature Stores (available)
Chapter 5: Hopsworks Feature Store (unavailable)
Chapter 6: Model-Independent Transformations (unavailable)
Chapter 7: Model-Dependent Transformations (unavailable)
Chapter 8: Batch Feature Pipelines (unavailable)
Chapter 9: Streaming Feature Pipelines (unavailable)
Chapter 10: Training Pipelines (unavailable)
Chapter 11: Inference Pipelines (unavailable)
Chapter 12: MLOps (unavailable)
Chapter 13: Feature and Model Monitoring (unavailable)
Chapter 14: Vector Databases (unavailable)
Chapter 15: Case Study: Personalized Recommendations (unavailable)
Chapter 1. Building Machine Learning
Systems
A NOTE FOR EARLY RELEASE
READERS
,With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited
content as they write—so you can take advantage of these technologies long before the official
release of these titles.
This will be the 1st chapter of the final book. The GitHub repo can be found
at https://github.com/featurestorebook/mlfs-book.
If you have comments about how we might improve the content and/or examples in this book, or
if you notice missing material within this chapter, please reach out to the editor
at .
Imagine you have been tasked with producing a financial forecast for the upcoming financial year.
You decide to use machine learning as there is a lot of available data, but, not unexpectedly, the
data is spread across many different places—in spreadsheets and many different tables in the data
warehouse. You have been working for several years at the same organization, and this is not the
first time you have been given this task. Every year to date, the final output of your model has
been a PowerPoint presentation showing the financial projections. Each year, you trained a new
model, and your model made one prediction and you were finished with it. Each year, you started
effectively from scratch. You had to find the data sources (again), re-request access to the data to
create the features for your model, and then dig out the Jupyter notebook from last year and update
it with new data and improvements to your model.
This year, however, you realize that it may be worth investing the time in building the scaffolding
for this project so that you have less work to do next year. So, instead of delivering a powerpoint,
you decide to build a dashboard. Instead of requesting one-off access to the data, you build feature
pipelines that extract the historical data from its source(s) and compute the features (and labels)
used in your model. You have an insight that the feature pipelines can be used to do two things:
compute both the historical features used to train your model and compute the features that will be
used to make predictions with your trained model. Now, after training your model, you can connect
it to the feature pipelines to make predictions that power your dashboard. You thank yourself one
year later when you only have to tweak this ML system by adding/updating/removing features, and
training a new model. The time you saved in grunt data source, cleaning, and feature engineering,
you now use to investigate new ML frameworks and model architectures, resulting in a much
improved financial model, much to the delight of your boss.
The above example shows the difference between training a model to make a one-off prediction
on a static dataset versus building a batch ML system - a system that automates reading from data
sources, transforming data into features, training models, performing inference on new data with
the model, and updating a dashboard with the model’s predictions. The dashboard is the value
delivered by the model to stakeholders.
If you want a model to generate repeated value, the model should make predictions more than once.
That means, you are not finished when you have evaluated the model’s performance on a test set
drawn from your static dataset. Instead you will have to build ML pipelines, programs that
transform raw data into features, and feed features to your model for easy retraining, and feed new
, features to your model so that it can make predictions, generating more value with every prediction
it makes.
You have embarked on the same journey from training models on static datasets to building ML
systems. The most important part of that journey is working with dynamic data, see figure 1. This
means moving from static data, such as the hand curated datasets used in ML competitions found
on Kaggle.com, to batch data, datasets that are updated at some interval (hourly, daily, weekly,
yearly), to real-time data.
Figure 1-1. A ML system that only generates a one-off prediction on a static dataset generates less business
value than a ML system that can make predictions on a schedule with batches of input data. ML systems
that can make predictions with real-time data are more technically challenging, but can create even more
business value.
A ML system is a software system that manages the two main life cycles for a model: training and
inference (making predictions).
The Evolution of Machine Learning Systems
In the mid 2010s, revolutionary ML Systems started appearing in consumer Internet applications,
such as image tagging in Facebook and Google Translate. The first generation of ML systems were
either batch ML systems that make predictions on a schedule, see figure 2, or interactive online
ML systems that make predictions in response to user actions, see figure 3.