Python SDK Tutorial: Basic Local Compute Flow¶
In this notebook tutorial, we'll use TruEra's Python SDK to show a basic local compute flow.
What does "local compute" mean? What's different about it?¶
Sometimes it can be desirable to perform computations locally, especially if it is hard to execute your model in a remote environment. In local compute mode, you can use the SDK to analyze your model from your local machine. This unlocks a limited set of TruEra features wherever you have the Python SDK installed.
Before you begin ⬇️¶
⚠️ Make sure you have
truera
SDK package installed before going through this tutorial. See the Installation Instructions for additional help. You will also need to install an explanation package of your choice. We recommend TruEra'struera-qii
package if you have access to it; otherwise, you may useSHAP
, which might be slower and less accurate.👉 You can download and run this notebook by navigating to the Downloads page of your deployment and downloading the "Python SDK Local Quickstart" example notebook.
What we'll cover ☑️¶
- Create a project with some split data and models.
- Compare performance and explanations across models
In this basic local compute flow, we're creating splits from basic pandas DataFrame
objects.
Step 1: Connect to TruEra endpoint¶
What do I need to connect to my TruEra deployment?¶
- TruEra deployment URL. For most users, the TruEra URI will take the form
https://<your-truera-access-url>
. - Some form of authentication (basic auth or token auth).
For examples on how to authenticate, see Authentication in the Diagnostics Quickstart. Here, we will use token authentication.
# FILL ME!
TRUERA_URL = "<TRUERA_URL>"
TOKEN = '<AUTH_TOKEN>'
from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication
auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)
Step 2: Download sample data¶
Here we'll use data from scikit-learn's California housing dataset. This can be installed via the sklearn.datasets
module.
# Retrieve the data.
import pandas as pd
from sklearn.datasets import fetch_california_housing
data_bunch = fetch_california_housing()
XS_ALL = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_ALL = pd.DataFrame(data=data_bunch["target"], columns=["label"])
# Create train and test data splits.
from sklearn.model_selection import train_test_split
XS_TRAIN, XS_TEST, YS_TRAIN, YS_TEST = train_test_split(XS_ALL, YS_ALL, test_size=100)
data_all = XS_ALL.merge(YS_ALL, left_index=True, right_index=True).reset_index(names="id")
data_test = XS_TEST.merge(YS_TEST, left_index=True, right_index=True).reset_index(names="id")
data_train = XS_TRAIN.merge(YS_TRAIN, left_index=True, right_index=True).reset_index(names="id")
Step 3: Create a project¶
Note how this is very similar to the remote flow demonstrated in the Diagnostics Quickstart -- the notable difference is that we set our model execution to be local with tru.set_model_execution("local")
, and the subsequent commands are nearly identical.
tru.set_model_execution("local")
tru.add_project("California Housing-2", score_type="regression")
Step 4: Add the data collection and data split¶
Here we're adding data via simple pd.DataFrame
s.
from truera.client.ingestion import ColumnSpec
column_spec = ColumnSpec(
id_col_name="id",
pre_data_col_names=XS_ALL.columns.to_list(),
label_col_names=YS_ALL.columns.to_list()
)
tru.add_data_collection("sklearn_data")
tru.add_data(data_all, data_split_name="all", column_spec=column_spec)
tru.add_data(data_train, data_split_name="train", column_spec=column_spec)
tru.add_data(data_test, data_split_name="test", column_spec=column_spec)
Step 5: Train and add a linear regression model¶
# Train the model.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lr_model = LinearRegression()
lr_model.fit(XS_TRAIN, YS_TRAIN)
print(f"RMSE = {mean_squared_error(YS_TEST, lr_model.predict(XS_TEST), squared=False)}")
We can add the model itself via tru.add_python_model()
, which accepts a number of out-of-the box model frameworks.
# Add to TruEra workspace.
tru.add_python_model("linear regression", lr_model)
tru.compute_all()
# View ISP.
tru.get_explainer(base_data_split="test").plot_isp(feature='HouseAge')
After computations, go view the project on the remote TruEra platform!
After some investigation, you decide that adding a boosted tree model is necessary...
Step 6: Adding a (new) boosted tree model¶
Let's explore how we can continue to work with local model execution. Here, we will add a new model to our workspace, under the same project as before.
# Train the model.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
gb_model = GradientBoostingRegressor()
gb_model.fit(XS_TRAIN, YS_TRAIN)
print(f"RMSE = {mean_squared_error(YS_TEST, gb_model.predict(XS_TEST), squared=False)}")
# Add to TruEra workspace.
tru.add_python_model("gradient boosted", gb_model)
tru.compute_predictions()
tru.compute_feature_influences()
tru.compute_error_influences()
Step 7: Compare models¶
Even in the SDK's local environment, there is a rich set of analytics to use. As an example, here we use Explainer
objects to analyze how each feature is being used by the model.
# Retrieve explainers.
tru.set_model("linear regression")
lr_explainer = tru.get_explainer(base_data_split="test")
tru.set_model("gradient boosted")
gb_explainer = tru.get_explainer(base_data_split="test")
# Compare various error metrics.
for error_metric in ["RMSE", "MSE", "MAE", "R_SQUARED", "EXPLAINED_VARIANCE"]:
print(f"{error_metric}:")
print(f"\tLR = {lr_explainer.compute_performance(error_metric)}")
print(f"\tGB = {gb_explainer.compute_performance(error_metric)}")
# Compare the ISPs of the MedInc feature.
lr_explainer.plot_isp("MedInc", figsize=(500, 300))
gb_explainer.plot_isp("MedInc", figsize=(500, 300))
You can view the models in your TruEra UI.