Ingesting Custom Python Models via the `Callable` Function¶

This notebook gives an example of ingesting a custom ensemble Python model through the Python SDK via creating a Callable function.

Before you begin¶

Install the TruEra Python SDK
Check our overview on Ingestion and Python ingestion

What we'll cover ☑️¶

In this tutorial we will:

Package/serialize a custom Python model (i.e. not a standard xboost, scikit-learn, etc. model) by wrapping it in a Callable Python function and packaging this for ingestion.
Verify locally this packaged model works.
Ingest the model.

Connecting to your TruEra endpoint¶

Provide your TruEra deployment URI as the CONNECTION_STRING
Provide your USERNAME and PASSWORD (example is for basic authentication)
Create your TruEra workspace object!

# Some preliminaries we'll need for the model...

import pandas as pd
import numpy as np 

import sklearn
import xgboost 
import pickle

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# FILL ME IN!

TRUERA_URL = "<TRUERA_URL>"
TOKEN = "<TRUERA_TOKEN>"

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

Add a project with some data¶

Before ingesting a model, it is recommended that you first ingest data. This will help validate the model later on. Here, we'll use the sample "California Housing" data from scikit-learn.

# First, let's create a regression project.

tru.add_project("CustomCallableFunctionModelExample", score_type="regression")

# Next, let's retrieve our data from sklearn and split our data into train and test splits.

data_bunch = fetch_california_housing()
XS_ALL = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_ALL = pd.DataFrame(data_bunch["target"], columns=["label"])
XS_TRAIN, XS_TEST, YS_TRAIN, YS_TEST = train_test_split(XS_ALL, YS_ALL, test_size=100)

data_all = XS_ALL.merge(YS_ALL, left_index=True, right_index=True).reset_index(names="id")
data_test = XS_TEST.merge(YS_TEST, left_index=True, right_index=True).reset_index(names="id")

from truera.client.ingestion import ColumnSpec

column_spec = ColumnSpec(
    id_col_name="id",
    pre_data_col_names=XS_ALL.columns.to_list(),
    label_col_names=YS_ALL.columns.to_list()
)

# And then upload them to TruEra's system.

tru.add_data_collection("sklearn_data")
tru.add_data(data_all, data_split_name="all", column_spec=column_spec)
tru.add_data(data_test, data_split_name="test", column_spec=column_spec)

WARNING:truera.client.remote_truera_workspace:Number of rows in the data split (20640) is larger than 5000. Will downsample the rows to 5000. Pass `sample_count=x` to override the default max number of samples.

Uploading tmp4w6ge5op.parquet -- ### -- file upload complete.
Put resource done.
Uploading tmp24uscbeq.parquet -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Call to join: rowsets ['42e7bfe3-c046-4879-9367-910c394b259a', '6363d152-bb38-4eac-aa29-2c2a6c2b2851'] on ['id'] with default inner join.

Uploading tmp9uq73cq9.parquet -- ### -- file upload complete.
Put resource done.
Uploading tmp7tb7vqgs.parquet -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Call to join: rowsets ['ad8371f6-7f6f-4739-b291-612825151db3', '87bef41f-4523-4107-80b4-2c8cbf86bfd3'] on ['id'] with default inner join.

Train the model¶

We're now ready to train a model. We'll create a simple ensemble regressor that averages predictions between two base models: * An xgboost.XGBRegressor boosted decision tree * An sklearn.linear_model.LinearRegression linear regressor

class CustomEnsembleRegressor:
    def __init__(self):
        self.base_model1 = xgboost.XGBRegressor() # xgboost regressor
        self.base_model2 = LinearRegression() # sklearn linear regression
        self.weights = [0.5, 0.5]

    def train_model(self, x: pd.DataFrame, y: np.ndarray):
        self.base_model1.fit(x, y)
        self.base_model2.fit(x, y)

    def get_ensemble_predictions(self, x: pd.DataFrame) -> pd.DataFrame:
        base_model1_preds = self.base_model1.predict(x).flatten()
        base_model2_preds = self.base_model2.predict(x).flatten()
        return np.average(np.vstack((base_model1_preds, base_model2_preds)), axis=0, weights=self.weights)

c = CustomEnsembleRegressor()
c.train_model(XS_TRAIN.drop(columns=["id"]), YS_TRAIN.drop(columns=["id"]))

Package the model¶

Now that the model is trained, let's package it. There are a few paths to do so, but in this notebook, we'll package our model by implementing a Callable Python function.

So long as the function implements the predict function that one normally finds in a model wrapper (taking in a pd.DataFrame of feature values) TruEra will try to serialize it as a pickled object.

Note in this case it's also necessary to directly pass in additional_pip_dependencies when packaging the model so that when constructing the Python conda environment on the remote server the desired packages are there.

In this case, our Callable can be a simple wrapper around our custom model:

def truera_predict_callable(x: pd.DataFrame):
    return c.get_ensemble_predictions(x)

Now, let's package it via the create_packaged_python_model function.

import uuid

packaged_model_path = f"/tmp/custom_callable_function_model_example_{uuid.uuid4()}/" # empty directory to package model in
tru.create_packaged_python_model(
    packaged_model_path,
    model_obj=truera_predict_callable,
    additional_pip_dependencies=[
        f"xgboost=={xgboost.__version__}",
        f"scikit-learn=={sklearn.__version__}"
    ])

INFO:truera.client.remote_truera_workspace:Successfully generated template model package at path '/tmp/custom_callable_function_model_example_a29b7936-9830-4090-93b0-14ec3b07d353/'

Let's check out the packaged model path... things look good!

! tree {packaged_model_path}

/bin/bash: tree: command not found

Our pip dependencies were automatically captured in the generated conda.yaml file.

! cat {packaged_model_path}/conda.yaml

dependencies:
- python=3.10.6
- pip=22.2.2
- pip:
  - cloudpickle==2.2.0
  - dynaconf==3.1.12
  - numpy==1.23.5
  - pandas==1.5.2
  - scikit-learn==1.2.0
  - scipy==1.9.3
  - xgboost==1.6.2
name: truera-env

Uploading the packaged model¶

We're almost there -- let's locally verify the packaged model and then upload it.

Local verification will attempt to load the model, call its predict function, and test that the output matches what's expected for the project type. It will not attempt to create the conda environment necessary to run your model, so you may want to independently verify that the conda.yaml looks correct.

tru.verify_packaged_model(packaged_model_path)

INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!

Things look good, so finally, let's upload the packaged model via tru.add_packaged_python_model. We'll call our model "ensemble_model".

tru.add_packaged_python_model("ensemble_model", packaged_model_path)

INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!

Uploading tmpv7p8n2je -- ### -- file upload complete.
Uploading MLmodel -- ### -- file upload complete.
Uploading conda.yaml -- ### -- file upload complete.
Uploading truera_pred_func_regression_predict_wrapper.py -- ### -- file upload complete.
Uploading truera_model.py -- ### -- file upload complete.
Uploading truera_model.cpython-310.pyc -- ### -- file upload complete.
Uploading truera_pred_func_regression_predict_wrapper.cpython-310.pyc -- ###

INFO:truera.client.remote_truera_workspace:Model "ensemble_model" is added and associated with remote data collection "sklearn_data". "ensemble_model" is set as the model for the workspace context.

 -- file upload complete.
Put resource done.
Model uploaded to: http://localhost:8000/home/p/CustomCallableFunctionModelExample/m/ensemble_model/

WARNING:truera.client.intelligence.remote_explainer:Background split for `data_collection` "sklearn_data" is currently not set. Setting it to "all"
INFO:truera.client.remote_truera_workspace:Triggering computations for model predictions on split all.
INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "sklearn_data". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "ensemble_model".

Access model predictions and influences¶

We can hit the UI link above to access information about your uploaded model, but we can also start your analysis directly from the Python SDK. Now that the model is uploaded, we can grab model outputs like predictions and feature influences from the TruEra system:

tru.compute_predictions(0, 10)
tru.get_ys_pred(0, 10)

|          | 0.000% [00:00<?]

point_640      1.942866
point_2712     0.508349
point_1257     1.707198
point_10269    1.758623
point_17044    3.999858
point_1652     2.850370
point_1907     0.980726
point_17385    1.775845
point_19475    1.147184
point_1591     4.762083
Name: score, dtype: float64

tru.compute_feature_influences(0, 10)
tru.get_feature_influences(0, 10)

|          | 0.000% [00:00<?]

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
point_640	-0.393895	0.017119	-0.027818	-0.006644	0.021310	0.210646	-0.953511	1.010976
point_2712	-0.688113	-0.051078	0.027414	0.011555	-0.004505	-0.065471	1.190519	-1.951096
point_1257	-0.231951	0.003520	0.037004	0.095989	0.006378	0.052209	-1.166368	0.795085
point_10269	-0.213583	-0.053171	-0.022700	-0.000577	-0.001815	0.008619	0.687959	-0.700089
point_17044	1.178157	0.015594	0.023466	-0.030801	0.012536	-0.021975	-0.557910	1.309175
point_1652	1.025732	-0.127811	0.076188	-0.075044	-0.001919	-0.079224	-0.935064	0.940869
point_1907	-0.047006	-0.070450	-0.004030	0.022820	-0.008273	-0.018247	-1.228289	0.288147
point_17385	-0.298667	-0.108607	-0.003217	0.050328	0.010681	0.012052	-0.084753	0.150556
point_19475	-0.307144	-0.064951	-0.057349	-0.068266	-0.003497	-0.013060	-0.756818	0.349468
point_1591	2.486856	-0.030581	0.098111	-0.009027	0.001794	-0.044701	-0.751739	0.954977

Ingesting Custom Python Models via the Callable Function¶