Python SDK Tutorial: Basic Local Compute Flow¶

In this notebook tutorial, we'll use TruEra's Python SDK to show a basic local compute flow.

What does "local compute" mean? What's different about it?¶

Sometimes it can be desirable to perform computations locally, especially if it is hard to execute your model in a remote environment. In local compute mode, you can use the SDK to analyze your model from your local machine. This unlocks a limited set of TruEra features wherever you have the Python SDK installed.

Before you begin ⬇️¶

⚠️ Make sure you have truera SDK package installed before going through this tutorial. See the Installation Instructions for additional help. You will also need to install an explanation package of your choice. We recommend TruEra's truera-qii package if you have access to it; otherwise, you may use SHAP, which might be slower and less accurate.

👉 You can download and run this notebook by navigating to the Downloads page of your deployment and downloading the "Python SDK Local Quickstart" example notebook.

What we'll cover ☑️¶

Create a project with some split data and models.
Compare performance and explanations across models

In this basic local compute flow, we're creating splits from basic pandas DataFrame objects.

Step 1: Connect to TruEra endpoint¶

What do I need to connect to my TruEra deployment?¶

TruEra deployment URL. For most users, the TruEra URI will take the form https://<your-truera-access-url>.
Some form of authentication (basic auth or token auth).

For examples on how to authenticate, see Authentication in the Diagnostics Quickstart. Here, we will use token authentication.

# FILL ME! 

TRUERA_URL = "<TRUERA_URL>"
TOKEN = '<AUTH_TOKEN>'

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

INFO:truera.client.remote_truera_workspace:Connecting to 'https://app.truera.net'

Step 2: Download sample data¶

Here we'll use data from scikit-learn's California housing dataset. This can be installed via the sklearn.datasets module.

# Retrieve the data.

import pandas as pd
from sklearn.datasets import fetch_california_housing

data_bunch = fetch_california_housing()
XS_ALL = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_ALL = pd.DataFrame(data=data_bunch["target"], columns=["label"])

# Create train and test data splits.

from sklearn.model_selection import train_test_split

XS_TRAIN, XS_TEST, YS_TRAIN, YS_TEST = train_test_split(XS_ALL, YS_ALL, test_size=100)

data_all = XS_ALL.merge(YS_ALL, left_index=True, right_index=True).reset_index(names="id")
data_test = XS_TEST.merge(YS_TEST, left_index=True, right_index=True).reset_index(names="id")
data_train = XS_TRAIN.merge(YS_TRAIN, left_index=True, right_index=True).reset_index(names="id")

Step 3: Create a project¶

Note how this is very similar to the remote flow demonstrated in the Diagnostics Quickstart -- the notable difference is that we set our model execution to be local with tru.set_model_execution("local"), and the subsequent commands are nearly identical.

tru.set_model_execution("local")
tru.add_project("California Housing-2", score_type="regression")

INFO:truera.client.truera_workspace:Model execution environment set to 'local'

Step 4: Add the data collection and data split¶

Here we're adding data via simple pd.DataFrames.

from truera.client.ingestion import ColumnSpec

column_spec = ColumnSpec(
    id_col_name="id",
    pre_data_col_names=XS_ALL.columns.to_list(),
    label_col_names=YS_ALL.columns.to_list()
)

tru.add_data_collection("sklearn_data")
tru.add_data(data_all, data_split_name="all", column_spec=column_spec)
tru.add_data(data_train, data_split_name="train", column_spec=column_spec)
tru.add_data(data_test, data_split_name="test", column_spec=column_spec)

Uploading tmpfnulj0_c.parquet (927.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 7b3ddd3b-98fd-4c6b-9066-ff8b48e4beeb finished with status: SUCCEEDED.

Uploading tmp0qg4kbic.parquet (933.0KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: ef4b9ea8-e310-438f-8bbd-c0f4db42694f finished with status: SUCCEEDED.

Uploading tmp2hfxwk0i.parquet (13.6KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 832e5316-e5a7-4fff-9e90-2ed9ac48a8b4 finished with status: SUCCEEDED.

Step 5: Train and add a linear regression model¶

# Train the model.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr_model = LinearRegression()
lr_model.fit(XS_TRAIN, YS_TRAIN)
print(f"RMSE = {mean_squared_error(YS_TEST, lr_model.predict(XS_TEST), squared=False)}")

RMSE = 0.553543059676275

We can add the model itself via tru.add_python_model(), which accepts a number of out-of-the box model frameworks.

# Add to TruEra workspace.

tru.add_python_model("linear regression", lr_model)
tru.compute_all()

INFO:truera.client.remote_truera_workspace:Uploading sklearn model: LinearRegression
WARNING:truera.client.services.aiq_client:The number of records returned will not be the exact number requested but in the neighborhood of the start and stop limit provided.
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!

Uploading MLmodel (214.0B) -- ### -- file upload complete.
Uploading conda.yaml (210.0B) -- ### -- file upload complete.
Uploading tmp98t53eh8 (774.0B) -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.py (431.0B) -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.cpython-310.pyc (1.1KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Model "linear regression" added and associated with data collection "sklearn_data". "linear regression" is set as the model in context.
INFO:truera.client.remote_truera_workspace:Model uploaded to: https://app.truera.net/home/p/California%20Housing-2/m/linear%20regression/
WARNING:truera.client.intelligence.remote_explainer:Background split for `data_collection` "sklearn_data" is currently not set. Setting it to "all"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

Uploading tmpf_zwt6vg.parquet (337.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: c428fe02-f1b9-48d5-b79b-249ea1ed1789 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

|          | 0.000% [00:00<?]

Uploading tmp7qpo31hg.parquet (85.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: fe594ba2-4399-4d50-a19e-935bcf7f83f0 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Inferred error `score_type` to be "mean_absolute_error_for_regression"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

|          | 0.000% [00:00<?]

Uploading tmp2dgiv0tr.parquet (85.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: b6c1d667-9cc4-4c84-8c43-557efceb0b31 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

Uploading tmpfq_2bz43.parquet (335.7KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 6696a72e-cfbd-4496-8c8c-f25a281620e0 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

|          | 0.000% [00:00<?]

Uploading tmpjycxue1q.parquet (85.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 1b7064bb-703f-4a2e-9fe9-fe54fe5189ee finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Inferred error `score_type` to be "mean_absolute_error_for_regression"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

|          | 0.000% [00:00<?]

Uploading tmp6y4i8lys.parquet (85.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: fc29854e-b5d8-42b1-92af-4fafd510f17c finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

Uploading tmpi6kzbnh7.parquet (3.7KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 8fbcef8e-11b0-4d0d-8cb2-c1643b39477c finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

|          | 0.000% [00:00<?]

Uploading tmpi48qsp3y.parquet (14.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 9cb8a7dd-c65a-4a66-99a2-d22b54bd1072 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Inferred error `score_type` to be "mean_absolute_error_for_regression"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model linear regression...

|          | 0.000% [00:00<?]

Uploading tmptgmg2ovm.parquet (14.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: a0b70f44-4b4b-4382-9a4c-6ebaaa5e2ee5 finished with status: SUCCEEDED.
INFO:truera.client.remote_truera_workspace:Data collection in workspace context set to "sklearn_data".
INFO:truera.client.remote_truera_workspace:Setting model context to "linear regression".

# View ISP.

tru.get_explainer(base_data_split="test").plot_isp(feature='HouseAge')

After computations, go view the project on the remote TruEra platform!

After some investigation, you decide that adding a boosted tree model is necessary...

Step 6: Adding a (new) boosted tree model¶

Let's explore how we can continue to work with local model execution. Here, we will add a new model to our workspace, under the same project as before.

# Train the model.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

gb_model = GradientBoostingRegressor()
gb_model.fit(XS_TRAIN, YS_TRAIN)
print(f"RMSE = {mean_squared_error(YS_TEST, gb_model.predict(XS_TEST), squared=False)}")

RMSE = 0.38900533769706286

# Add to TruEra workspace.

tru.add_python_model("gradient boosted", gb_model)
tru.compute_predictions()
tru.compute_feature_influences()
tru.compute_error_influences()

INFO:truera.client.remote_truera_workspace:Uploading sklearn model: GradientBoostingRegressor
WARNING:truera.client.services.aiq_client:The number of records returned will not be the exact number requested but in the neighborhood of the start and stop limit provided.
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!

Uploading MLmodel (214.0B) -- ### -- file upload complete.
Uploading conda.yaml (210.0B) -- ### -- file upload complete.
Uploading tmp0611kikq (130.2KiB) -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.py (431.0B) -- ### -- file upload complete.
Uploading sklearn_regression_predict_wrapper.cpython-310.pyc (1.1KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Model "gradient boosted" added and associated with data collection "sklearn_data". "gradient boosted" is set as the model in context.
INFO:truera.client.remote_truera_workspace:Model uploaded to: https://app.truera.net/home/p/California%20Housing-2/m/gradient%20boosted/
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

Uploading tmpmb3j5ili.parquet (3.7KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: f4aef95c-c38a-481b-9a61-26c92e6e7d45 finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

|          | 0.000% [00:00<?]

Uploading tmpec9l4gpe.parquet (14.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: a4b23a67-f425-454d-b9f0-26d38e9f6ded finished with status: SUCCEEDED.
INFO:truera.client.truera_workspace:Inferred error `score_type` to be "mean_absolute_error_for_regression"
INFO:truera.client.truera_workspace:Downloading artifacts to temp_dir: /var/folders/6g/rp51n4c10mldf_61mqzc0mc00000gn/T/tmpxe39sgyk
INFO:truera.client.truera_workspace:Downloading model gradient boosted...

|          | 0.000% [00:00<?]

Uploading tmplnu_9768.parquet (14.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 90a8de31-2d98-48a3-b567-00e0deb27b81 finished with status: SUCCEEDED.

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
37933744-9ffb-4a52-8433-a79b74221b88
17412	-0.680593	0.026112	-0.039409	0.071118	-0.002071	-0.212579	-0.293294	0.192401
18251	-0.074921	-0.029566	-0.029226	0.006284	0.006347	-0.575807	0.072580	-0.500345
2176	-0.649592	-0.050351	-0.006631	0.015247	-0.000573	0.061959	-0.504981	-0.029984
15894	-0.376487	0.102599	-0.025148	0.000218	-0.004234	-0.016403	-0.240849	0.282641
12643	-0.668289	0.007974	-0.047890	0.001900	0.009146	-0.054827	-0.715779	0.124736
...	...	...	...	...	...	...	...	...
10059	-0.245932	-0.042514	0.020028	-0.002460	-0.001626	-0.025479	-0.280734	0.055589
16295	-0.550952	0.013674	-0.058326	0.021343	-0.002680	-0.144939	-0.580034	0.098909
6196	-0.443268	0.043260	0.003702	-0.008924	-0.003104	0.260163	0.037626	-0.204878
9125	-0.113079	0.021497	-0.021557	0.010158	0.003754	-0.265667	0.011572	-0.221737
11321	-0.193623	-0.008758	-0.019624	-0.006581	0.003570	-0.150544	0.038078	-0.215544

100 rows × 8 columns

Step 7: Compare models¶

Even in the SDK's local environment, there is a rich set of analytics to use. As an example, here we use Explainer objects to analyze how each feature is being used by the model.

# Retrieve explainers.

tru.set_model("linear regression")
lr_explainer = tru.get_explainer(base_data_split="test")

tru.set_model("gradient boosted")
gb_explainer = tru.get_explainer(base_data_split="test")

INFO:truera.client.remote_truera_workspace:Setting model context to "linear regression".
INFO:truera.client.remote_truera_workspace:Setting model context to "gradient boosted".

# Compare various error metrics.

for error_metric in ["RMSE", "MSE", "MAE", "R_SQUARED", "EXPLAINED_VARIANCE"]:
    print(f"{error_metric}:")
    print(f"\tLR = {lr_explainer.compute_performance(error_metric)}")
    print(f"\tGB = {gb_explainer.compute_performance(error_metric)}")

RMSE:
    LR = 0.5535430312156677
    GB = 0.3890053331851959
MSE:
    LR = 0.30640992522239685
    GB = 0.15132515132427216
MAE:
    LR = 0.45746567845344543
    GB = 0.2957201302051544
R_SQUARED:
    LR = 0.7342666983604431
    GB = 0.8687636256217957
EXPLAINED_VARIANCE:
    LR = 0.7360846996307373
    GB = 0.8700141310691833

# Compare the ISPs of the MedInc feature.

lr_explainer.plot_isp("MedInc", figsize=(500, 300))
gb_explainer.plot_isp("MedInc", figsize=(500, 300))

You can view the models in your TruEra UI.