Python SDK Tutorial: Drift Analysis¶

Drift analyses are crucial to understanding how your model will fare on different distributions. This notebook covers how one might use TruEra's Python SDK to examine the drift of a model between train and test data -- but this could also be used to examine a model's behavior over time.

Before you begin ⬇️¶

Install the TruEra Python SDK
Check our primer on Explainer objects
Read our tutorial on local compute mode for the Python SDK.

What we'll cover ☑️¶

First, we'll create a TruEra project. We'll use sample data from scikit-learn, and create a project with a sample gradient boosted tree model. We'll also ingest train and test split data for this model.
We'll then track the performance of your model between train and test sets with an Explainer object.
Finally, we'll drill into the root causes of the instability between distributions, so we can understand and debug your model.

Step 1: Create a TruEra workspace¶

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

INFO:truera.client.remote_truera_workspace:Connecting to 'https://app.truera.net'

Step 2: Download sample data¶

Here we'll use the data from scikit-learn's California Housing dataset, which is a regression dataset. This is available directly from the sklearn.datasets module.

# Retrieve the data.

import pandas as pd
from sklearn.datasets import fetch_california_housing

data_bunch = fetch_california_housing()
XS_ALL = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_ALL = pd.DataFrame(data=data_bunch["target"], columns=["label"])

# Create train and test data splits.

from sklearn.model_selection import train_test_split

XS_TRAIN, XS_TEST, YS_TRAIN, YS_TEST = train_test_split(XS_ALL, YS_ALL, test_size=0.5, random_state=0)

Step 3: Add targeted noise to data¶

We'll add two kinds of noise to exacerbate the differences between our train and test sets for the purpose of this notebook: 1. Shift the HouseAge feature in the test data (but not the train data) by 10. This is an example of data drift. 2. When the HouseAge feature is in between 20 and 30, set the label to 0. This is an example of mislabelled data points.

XS_TEST["HouseAge"] += 10
YS_TRAIN[(20 <= XS_TRAIN["HouseAge"]) & (XS_TRAIN["HouseAge"] < 30)] = 0

Step 4a: Create a project¶

tru.add_project("California Housing", score_type="regression")

Step 4b: Create the data collection and add split data¶

from truera.client.ingestion import ColumnSpec

column_spec = ColumnSpec(
    id_col_name="id",
    label_col_names="label",
    pre_data_col_names=data_bunch["feature_names"]
)

data_test = XS_TEST.merge(YS_TEST, left_index=True, right_index=True).reset_index(names="id")
data_train = XS_TRAIN.merge(YS_TRAIN, left_index=True, right_index=True).reset_index(names="id")

tru.add_data_collection("sklearn_data")
tru.add_data(data_train, data_split_name="train", column_spec=column_spec)
tru.add_data(data_test, data_split_name="test", column_spec=column_spec)

Uploading tmpgt0oaqxb.parquet (263.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: c93a200a-787b-4ebc-91e4-fdaf8ceb4a3b finished with status: SUCCEEDED.

Uploading tmp_6bulfjk.parquet (261.1KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 4b22bfed-b2c4-47b0-93c2-097595255da9 finished with status: SUCCEEDED.

Step 4c: Train and add a model to the data collection¶

# Train the model.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

gb_model = GradientBoostingRegressor(random_state=0)
gb_model.fit(XS_TRAIN, YS_TRAIN)

GradientBoostingRegressor(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Add to TruEra workspace.

tru.add_python_model("gradient boosted", gb_model)
tru.compute_all()

INFO:truera.client.remote_truera_workspace:Uploading sklearn model: GradientBoostingRegressor
WARNING:truera.client.services.aiq_client:The number of records returned will not be the exact number requested but in the neighborhood of the start and stop limit provided.
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!

Uploading MLmodel (214.0B) -- ### -- file upload complete.
Uploading tmp1v9s7tvz (130.5KiB) -- ### -- file upload complete.
Uploading conda.yaml (210.0B) -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.py (431.0B) -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.cpython-310.pyc (1.1KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Model "gradient boosted" added and associated with data collection "sklearn_data". "gradient boosted" is set as the model in context.
INFO:truera.client.remote_truera_workspace:Model uploaded to: https://app.truera.net/home/p/California%20Housing/m/gradient%20boosted/
WARNING:truera.client.intelligence.remote_explainer:Background split for `data_collection` "sklearn_data" is currently not set. Setting it to "train"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpfv1lb5j4
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

Uploading tmp59bf2b12.parquet (82.5KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 896eb82b-46b2-4437-a8ef-d90de8963dbd finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpfv1lb5j4
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

|          | 0.000% [00:00<?]

Uploading tmp28ash457.parquet (85.0KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: fa838de4-a8db-4e6a-b597-307922b06289 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Inferred error `score_type` to be "mean_absolute_error_for_regression"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpfv1lb5j4
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

|          | 0.000% [00:00<?]

Uploading tmpg4wy9pjl.parquet (85.0KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 3154417a-cf11-4d1a-802c-3a7c21e6ee5d finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpfv1lb5j4
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

Uploading tmpgo91ut7i.parquet (81.5KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: be8bbfe5-898b-4fc7-b4e8-0eb508192a03 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpfv1lb5j4
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

|          | 0.000% [00:00<?]

Uploading tmp6tkbu238.parquet (85.0KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 0b6c6a2d-eb12-4399-85af-e81e611cc823 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Inferred error `score_type` to be "mean_absolute_error_for_regression"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpfv1lb5j4
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

|          | 0.000% [00:00<?]

Uploading tmp953mfyt3.parquet (85.0KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 316e5fac-1cb1-44a4-862e-94782c2666aa finished with status: SUCCEEDED.
INFO:truera.client.remote_truera_workspace:Data collection in workspace context set to "sklearn_data".
INFO:truera.client.remote_truera_workspace:Setting model context to "gradient boosted".

Step 5: Examine model accuracy between train and test¶

Here, we create an explainer object setting train as our base data split, and test as our comparison data split. This enables us to easily compare performance across splits.

explainer = tru.get_explainer(base_data_split="train", comparison_data_splits=["test"])
explainer.compute_performance("RMSE")

	Split	RMSE
0	train	1.759295
1	test	1.701742

We can see there is a marked gap between the RMSE of our train and test splits.

Step 6: Find the cause of the instability¶

By examining what features shift in the influence space, we can identify potential causes of the instability. This analysis can be carried out in the Python SDK as below, or in the Stability workflow of the TruEra UI.

# Find feature that has shifted the most.

instability = explainer.compute_feature_contributors_to_instability("regression")
instability.T.sort_values(by="test", ascending=False)

WARNING:truera.client.services.aiq_client:The number of records returned will not be the exact number requested but in the neighborhood of the start and stop limit provided.

	test
HouseAge	0.508728
MedInc	0.265151
AveOccup	0.093151
Latitude	0.066001
Longitude	0.038567
AveRooms	0.018439
AveBedrms	0.006592
Population	0.003371

Given the HouseAge feature has shifted so heavily let's plot its distribution in both train and test.

import matplotlib.pyplot as plt

plt.figure(figsize=(21, 6))
XS_TRAIN["HouseAge"].hist()
XS_TEST["HouseAge"].hist()
plt.legend(["Train", "Test"])
plt.xlabel("`HouseAge` value")
plt.ylabel("Frequency")

Text(0, 0.5, 'Frequency')

This shows some odd behavior in that the distribution of the HouseAge seems to have shifted between the train data and the test data. In fact, it appears that the data has shifted by around 10. So we were able to catch the issue!

Given the problematic behavior, let's also look at the influence sensitivity plot (ISP) of the feature.

explainer = tru.get_explainer(base_data_split="train")
explainer.plot_isp("HouseAge")

The data does appear quite fishy in the 20 to 30 region, as we might expect!