Ingesting Custom Python Models via the Callable
Function¶
This notebook gives an example of ingesting a custom ensemble Python model through the Python SDK via creating a Callable
function.
Before you begin¶
- Install the TruEra Python SDK
- Check our overview on Ingestion and Python ingestion
What we'll cover ☑️¶
In this tutorial we will:
- Package/serialize a custom Python model (i.e. not a standard
xboost
,scikit-learn
, etc. model) by wrapping it in aCallable
Python function and packaging this for ingestion. - Verify locally this packaged model works.
- Ingest the model.
Connecting to your TruEra endpoint¶
- Provide your TruEra deployment URI as the
CONNECTION_STRING
- Provide your
USERNAME
andPASSWORD
(example is for basic authentication) - Create your TruEra workspace object!
# Some preliminaries we'll need for the model...
import pandas as pd
import numpy as np
import sklearn
import xgboost
import pickle
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# FILL ME IN!
TRUERA_URL = "<TRUERA_URL>"
TOKEN = "<TRUERA_TOKEN>"
from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication
auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)
Add a project with some data¶
Before ingesting a model, it is recommended that you first ingest data. This will help validate the model later on. Here, we'll use the sample "California Housing" data from scikit-learn
.
# First, let's create a regression project.
tru.add_project("CustomCallableFunctionModelExample", score_type="regression")
# Next, let's retrieve our data from sklearn and split our data into train and test splits.
data_bunch = fetch_california_housing()
XS_ALL = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_ALL = pd.DataFrame(data_bunch["target"], columns=["label"])
XS_TRAIN, XS_TEST, YS_TRAIN, YS_TEST = train_test_split(XS_ALL, YS_ALL, test_size=100)
data_all = XS_ALL.merge(YS_ALL, left_index=True, right_index=True).reset_index(names="id")
data_test = XS_TEST.merge(YS_TEST, left_index=True, right_index=True).reset_index(names="id")
from truera.client.ingestion import ColumnSpec
column_spec = ColumnSpec(
id_col_name="id",
pre_data_col_names=XS_ALL.columns.to_list(),
label_col_names=YS_ALL.columns.to_list()
)
# And then upload them to TruEra's system.
tru.add_data_collection("sklearn_data")
tru.add_data(data_all, data_split_name="all", column_spec=column_spec)
tru.add_data(data_test, data_split_name="test", column_spec=column_spec)
Train the model¶
We're now ready to train a model. We'll create a simple ensemble regressor that averages predictions between two base models:
* An xgboost.XGBRegressor
boosted decision tree
* An sklearn.linear_model.LinearRegression
linear regressor
class CustomEnsembleRegressor:
def __init__(self):
self.base_model1 = xgboost.XGBRegressor() # xgboost regressor
self.base_model2 = LinearRegression() # sklearn linear regression
self.weights = [0.5, 0.5]
def train_model(self, x: pd.DataFrame, y: np.ndarray):
self.base_model1.fit(x, y)
self.base_model2.fit(x, y)
def get_ensemble_predictions(self, x: pd.DataFrame) -> pd.DataFrame:
base_model1_preds = self.base_model1.predict(x).flatten()
base_model2_preds = self.base_model2.predict(x).flatten()
return np.average(np.vstack((base_model1_preds, base_model2_preds)), axis=0, weights=self.weights)
c = CustomEnsembleRegressor()
c.train_model(XS_TRAIN.drop(columns=["id"]), YS_TRAIN.drop(columns=["id"]))
Package the model¶
Now that the model is trained, let's package it. There are a few paths to do so, but in this notebook, we'll package our model by implementing a Callable
Python function.
So long as the function implements the predict
function that one normally finds in a model wrapper (taking in a pd.DataFrame
of feature values) TruEra will try to serialize it as a pickled object.
Note in this case it's also necessary to directly pass in additional_pip_dependencies
when packaging the model so that when constructing the Python conda environment on the remote server the desired packages are there.
In this case, our Callable
can be a simple wrapper around our custom model:
def truera_predict_callable(x: pd.DataFrame):
return c.get_ensemble_predictions(x)
Now, let's package it via the create_packaged_python_model
function.
import uuid
packaged_model_path = f"/tmp/custom_callable_function_model_example_{uuid.uuid4()}/" # empty directory to package model in
tru.create_packaged_python_model(
packaged_model_path,
model_obj=truera_predict_callable,
additional_pip_dependencies=[
f"xgboost=={xgboost.__version__}",
f"scikit-learn=={sklearn.__version__}"
])
Let's check out the packaged model path... things look good!
! tree {packaged_model_path}
Our pip
dependencies were automatically captured in the generated conda.yaml
file.
! cat {packaged_model_path}/conda.yaml
Uploading the packaged model¶
We're almost there -- let's locally verify the packaged model and then upload it.
Local verification will attempt to load the model, call its predict function, and test that the output matches what's expected for the project type. It will not attempt to create the conda
environment necessary to run your model, so you may want to independently verify that the conda.yaml
looks correct.
tru.verify_packaged_model(packaged_model_path)
Things look good, so finally, let's upload the packaged model via tru.add_packaged_python_model
. We'll call our model "ensemble_model".
tru.add_packaged_python_model("ensemble_model", packaged_model_path)
Access model predictions and influences¶
We can hit the UI link above to access information about your uploaded model, but we can also start your analysis directly from the Python SDK. Now that the model is uploaded, we can grab model outputs like predictions and feature influences from the TruEra system:
tru.compute_predictions(0, 10)
tru.get_ys_pred(0, 10)
tru.compute_feature_influences(0, 10)
tru.get_feature_influences(0, 10)