How to train an ML model
In this article, you will learn how to manage ML model training with Quix. In this example, we will train a model to predict car braking on a racing circuit 5 seconds ahead of time.
Watch
If you prefer watching instead of reading, we've recorded a short video:
Why this is important
With the Quix platform, you can leverage historic data to train your model to react to data coming from source with milliseconds latency.
End result
At the end of this article, we will end up with a pickle file trained on historic data.
Preparation
You will need Python3 installed.
You’ll need some data stored in the Quix platform. You can use any of our Data Sources available in the samples Library, or just follow the onboarding process when you sign-up to Quix
Tip
If in doubt, login to the Quix Portal, navigate to the Library and deploy "Demo Data - Source".
This will provide you with some real-time data for your experiments.
You’ll also need a Jupyter notebook environment to run your experiments and load data for training. Please use "How to work with Jupyter notebook".
Install required libraries
python3 -m pip install seaborn
python3 -m pip install sklearn
python3 -m pip install mlflow
python3 -m pip install matplotlib
Note
If you get an 'Access Denied' error installing mlflow try adding '--user' to the install command or run the installer from an Anaconda Powershell Prompt (with --user)
Tip
If you don’t see Python3 kernel in your Jupyter notebook, execute the following commands in your python environment:
Necessary imports
To execute all code blocks below, you need to start with importing these libraries. Add this code to the top of you Jupyter notebook.
import math
import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier
Training ML model
Getting training data
The Quix web application has a python code generator to help you connect your Jupyter notebook with Quix.
You need to be logged into the platform for this:
-
Select workspace (you likley only have one)
-
Go to the Data Explorer
-
Add a query to visualize some data. Select parameters, events, aggregation and time range
Note
Select
Brake
,Motion_WorldPositionX
,Steer
,Speed
,Gear
parameters and turn off aggregation! -
Select the Code tab
-
Ensure Python is the selected language
Copy the Python code to your Jupyter notebook and execute.
Tip
If you want to use this generated code for a long time, replace the temporary token with a PAT token. See authenticate your requests for how to do that.
Preprocessing of features
We will prepare data for training by applying some transformation on the downloaded data.
Execute this in your notebook:
## Convert motion to continuous values
df["Motion_WorldPositionX_sin"] = df["Motion_WorldPositionX"].map(lambda x: math.sin(x))
df["Motion_WorldPositionX_cos"] = df["Motion_WorldPositionX"].map(lambda x: math.cos(x))
Preprocessing of label
Here we simplify braking to a boolean value.
Generate advanced brake signal for training
Now we need to shift breaking 5 seconds ahead to train the model to predict breaking 5 seconds ahead.
## Offset dataset and trim it
NUM_PERIODS = -round(5e9/53852065.77281786)
df["Brake_shifted_5s"] = df["Brake_bool"].shift(periods=NUM_PERIODS)
df = df.dropna(axis='rows')
Lets review it in plot:
plt.figure(figsize=(15, 8))
plt.plot(df["Brake_shifted_5s"])
plt.plot(df["Brake_bool"])
plt.legend(['Shifted', 'Unshifted'])
Fit, predict and score a model
Calculate class weighting in case we gain any accuracy by performing class balancing.
Experiment
In the following code snippet we are executing an experiment using MLflow. Notice in last 3 lines that each experiment is logging MLflow metrics for experiments comparison later.
model_accuracy = pd.DataFrame(columns=[
'Baseline Training Accuracy',
'Model Training Accuracy',
'Baseline Testing Accuracy',
'Model Testing Accuracy',
])
kfold = KFold(5, shuffle=True, random_state=1)
with mlflow.start_run():
class_weight = None
max_depth = 5
features = ["Motion_WorldPositionX_cos", "Motion_WorldPositionX_sin", "Steer", "Speed", "Gear"]
mlflow.log_param("class_weight", class_weight)
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("features", features)
mlflow.log_param("model_type", "DecisionTreeClassifier")
X = df[features]
decision_tree = DecisionTreeClassifier(class_weight=class_weight, max_depth=max_depth)
for train, test in kfold.split(X):
X_train = X.iloc[train]
Y_train = Y.iloc[train]
X_test = X.iloc[test]
Y_test = Y.iloc[test]
# Train model
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
# Assess accuracy
train_accuracy = round(decision_tree.score(X_train, Y_train) * 100, 2)
test_accuracy = round(decision_tree.score(X_test, Y_test) * 100, 2)
Y_baseline_zeros = np.zeros(Y_train.shape)
baseline_train_accuracy = round(accuracy_score(Y_train, Y_baseline_zeros) * 100, 2)
Y_baseline_zeros = np.zeros(Y_test.shape)
baseline_test_accuracy = round(accuracy_score(Y_test, Y_baseline_zeros) * 100, 2)
model_accuracy = model_accuracy.append({
"Baseline Training Accuracy": baseline_train_accuracy,
"Model Training Accuracy": train_accuracy,
"Baseline Testing Accuracy": baseline_test_accuracy,
"Model Testing Accuracy": test_accuracy
}, ignore_index=True)
mlflow.log_metric("train_accuracy", model_accuracy["Model Training Accuracy"].mean())
mlflow.log_metric("test_accuracy", model_accuracy["Model Testing Accuracy"].mean())
mlflow.log_metric("fit_quality", 1/abs(model_accuracy["Model Training Accuracy"].mean() - model_accuracy["Model Testing Accuracy"].mean()))
We review experiment model accuracy:
Depth | Baseline Training Accuracy | Model Training Accuracy | Baseline Testing Accuracy | Model Testing Accuracy |
0 | 88.97 | 97.93 | 86.49 | 86.49 |
1 | 87.59 | 97.24 | 91.89 | 83.78 |
2 | 89.04 | 96.58 | 86.11 | 88.89 |
3 | 88.36 | 97.95 | 88.89 | 83.33 |
4 | 88.36 | 97.95 | 88.89 | 80.56 |
Table with model accuracy preview
Prediction preview
Let’s plot actual versus predicted braking using a trained model:
f, (ax1, ax2) = plt.subplots(2, 1, sharey=True, figsize=(50,8))
ax1.plot(Y)
ax1.plot(X["Speed"]/X["Speed"].max())
ax2.plot(decision_tree.predict(X))
ax2.plot(X["Speed"]/X["Speed"].max())
Saving model
When you are confident with the results, save the model into a file.
Tip
Pickle file will be located in folder where jupyter notebook command was executed
MLflow
To help you with experiments management, you can review experiments in MLflow.
Warning
MLflow works only on MacOS, Linux or Windows linux subsystem (WSL).
Tip
To have some meaningful data, run the experiment with 3 different max_depth
parameter.
Let’s leave Jupyter notebook for now and go back to command line and run MLflow server:
Select experiments to compare:
Plot metrics from experiments: