Ingesting Custom Python Models via Custom Model Wrapper¶

This notebook gives an example of ingesting a custom ensemble Python model through the Python SDK via creating a custom model wrapper.

Before you begin¶

Install the TruEra Python SDK

What we'll cover ☑️¶

Packaging/serializing a custom Python model (i.e., not a standard xboost, scikit-learn, etc. model) by creating our own custom model wrapper.
Locally verify that this packaged model works.
Ingest the model.

Connecting to your TruEra endpoint¶

Provide the TruEra URL.
Provide your authentication token (see Quickstart).
Create your TruEra workspace object!

# Some preliminaries we'll need for the model...

import pandas as pd
import pickle
import platform
import numpy as np 
import sklearn
import xgboost

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# FILL ME IN!

TRUERA_URL = "<TRUERA_URL>"
TOKEN = "<TOKEN>"

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

Add a project with some data¶

Before ingesting a model, it is recommended that you first ingest data. This will help validate the model later on. Here, we'll use the sample "California Housing" data from scikit-learn.

# First, let's create a regression project.

tru.add_project("CustomModelWrapperExample", score_type="regression")

INFO:truera.client.remote_truera_workspace:remaining items: []
INFO:truera.client.remote_truera_workspace:Delete resource succeeded. Project_id: CustomModelWrapperExample intra_artifact_path:

# Next, let's retrieve our data from sklearn and split our data into train and test splits.

data_bunch = fetch_california_housing()
XS_ALL = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_ALL = pd.DataFrame(data_bunch["target"], columns=["label"])
XS_TRAIN, XS_TEST, YS_TRAIN, YS_TEST = train_test_split(XS_ALL, YS_ALL, test_size=100)

data_all = XS_ALL.merge(YS_ALL, left_index=True, right_index=True).reset_index(names="id")
data_test = XS_TEST.merge(YS_TEST, left_index=True, right_index=True).reset_index(names="id")

from truera.client.ingestion import ColumnSpec

column_spec = ColumnSpec(
    id_col_name="id",
    pre_data_col_names=XS_ALL.columns.to_list(),
    label_col_names=YS_ALL.columns.to_list()
)

# And then upload them to TruEra's system.

tru.add_data_collection("sklearn_data")
tru.add_data(data_all, data_split_name="all", column_spec=column_spec)
tru.add_data(data_test, data_split_name="test", column_spec=column_spec)

Uploading tmp2v6x5sa3.parquet (919.4KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...

Uploading tmpay7k4znq.parquet (13.5KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...

Train the model¶

We're now ready to train a model. We'll create a simple ensemble regressor that averages predictions between two base models: * An xgboost.XGBRegressor boosted decision tree * An sklearn.linear_model.LinearRegression linear regressor

class CustomEnsembleRegressor:
    def __init__(self):
        self.base_model1 = xgboost.XGBRegressor() # xgboost regressor
        self.base_model2 = LinearRegression() # sklearn linear regression
        self.np = np
        self.weights = [0.5, 0.5]

    def train_model(self, x: pd.DataFrame, y: np.ndarray):

        self.base_model1.fit(x, y)
        self.base_model2.fit(x, y)

    def get_ensemble_predictions(self, x: pd.DataFrame) -> pd.DataFrame:
        base_model1_preds = self.base_model1.predict(x).flatten()
        base_model2_preds = self.base_model2.predict(x).flatten()
        return self.np.average(self.np.vstack((base_model1_preds, base_model2_preds)), axis=0, weights=self.weights)

c = CustomEnsembleRegressor()
c.train_model(XS_TRAIN, YS_TRAIN)

Package the model¶

Now that the model is trained, let's package it. There are a few paths to do so, but in this notebook, we'll package our model by generating a template wrapper that we will manually edit. First, let's serialize the trained model object:

model_path = "/tmp/custom_regressor.p"
import cloudpickle
with open(model_path, "wb") as h:
    cloudpickle.dump(c, h)

Now, let's generate a template wrapper in a path of our choice, and pass in the pip dependencies we think we'll need to load the model. We do this with the tru.create_packaged_python_model command.

This generates a sequence of files like so:

📂 DIRECTORY OF PACKAGED MODEL
 ┣ 📜`conda.yaml` - controls the environment that the model is loading (NEEDS TO BE MODIFIED)
 ┣ 📜`MLmodel` - manifest file with information about the model type
 ┣ 📜`model.pkl` - serialized model, such as a pickle
 ┗ 📂`code`
    ┗ 📜`model_wrapper.py` - entry point to load and launch the model (NEEDS TO BE MODIFIED)

You can move any code needed to load the model into the code folder, edit the model_wrapper.py file to expose the necessary functions, and add any necessary dependencies to the conda.yaml file.

import uuid

packaged_model_path = f"/tmp/custom_model_wrapper_example_{uuid.uuid4()}" # empty directory to package model in
tru.create_packaged_python_model(
    packaged_model_path,
    model_path=model_path,
    python_version=platform.python_version(),
    additional_pip_dependencies=[
        f"xgboost=={xgboost.__version__}",
        f"scikit-learn=={sklearn.__version__}",
        f"cloudpickle=={cloudpickle.__version__}"
    ])

INFO:truera.client.remote_truera_workspace:Successfully generated template model package at path '/tmp/custom_model_wrapper_example_56730aa3-e145-4077-a352-8d2a34d8b38f'

Let's check out the packaged model path... we can see that there's a directory structure here with a few key fles: * Our model wrapper itself, in code/custom_template_regression_predict_wrapper.py * Our conda.yaml file, which specifies the environment for the model * Our serialized model itself, in custom_regressor.pkl

! tree {packaged_model_path}

/bin/bash: tree: command not found

First, let's sanity check that the conda.yaml looks good to go:

! cat {packaged_model_path}/conda.yaml

dependencies:
- python=3.10.6
- pip=22.1.2
- pip:
  - cloudpickle==2.2.1
  - scikit-learn==1.3.0
  - xgboost==1.7.6
name: truera-env

Great! Our dependencies were picked up. Now, let's check out what the template wrapper looks like right now...

! cat {packaged_model_path}/code/custom_template_regression_predict_wrapper.py

from typing import Dict, Union

import numpy
import pandas


class PredictWrapper(object):

    def __init__(self, model):
        self.model = model

    def predict(
        self, model_input: Union[pandas.DataFrame, Dict[str, numpy.ndarray],
                                 numpy.ndarray]
    ) -> Union[numpy.ndarray, pandas.Series, pandas.DataFrame]:
        """
            This function is a wrapper around your model's predict method.  
            Using the model artifacts that you have loaded, implement this predict function.

        Args:
            model_input (Union[
                pandas.DataFrame, 
                Dict[str, numpy.ndarray (ndim=1)], 
                numpy.ndarray (ndim=2)]): 
                Data that can be given directly to the model.

        Returns:
            Union[numpy.ndarray, pandas.Series, pandas.DataFrame]: Model output
        """
        pass

    def get_model(self):
        """
            **Optional**
            Only implement if you are using a model that is supported by truera-tree influence compuations.

        Returns:
            Model Object: Provides an api to access the underlying model structure
        """
        return self.model


def _load_pyfunc(path: str):
    """
        The purpose of this function is to load any serialized components of a given model and return a model wrapper that implements a predict method.
        This function is provided with the path to the model data file/directory that was specified during packaging.
        This function should return an instance of a class implementing a predict method with the signature shown above.  

    Args:
        path (str): Contains the path to the serialized model that was provided during packaging.  
            This is specified by the "data" field in the MLmodel file at the root of the packaged model

    Returns:
        PredictWrapper: Instance of the class defined above with an implemented predict method
    """

    loaded_model = "Loaded and Deserialized Model"
    return PredictWrapper(loaded_model)

There are a few things that we need to modify. First, the predict function isn't implemented, so we'll need to go ahead and do that. Second, the get_model function doesn't apply to our custom model, so we can remove that signature. Finally, we'll need to make sure the wrapper deserializes our serialized model. Let's instead rewrite the wrapper to look like the following:

wrapper_text = """
import cloudpickle

class PredictWrapper(object):

    def __init__(self, model):
        self.model = model

    def predict(self, model_input):
        return self.model.get_ensemble_predictions(model_input)


def _load_pyfunc(path: str):
    with open(path, "rb") as f:
        return PredictWrapper(cloudpickle.load(f))
"""

with open(f"{packaged_model_path}/code/custom_template_regression_predict_wrapper.py", "w") as h:
    h.write(wrapper_text)

Uploading the packaged model¶

We're almost there -- let's locally verify the packaged model and then upload it.

Local verification will attempt to load the model, call its predict function, and test that the output matches what's expected for the project type. It will not attempt to create the conda environment necessary to run your model, so you may want to independently verify that the conda.yaml looks correct.

tru.verify_packaged_model(packaged_model_path)

INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!

Things look good, so finally, let's upload the packaged model via tru.add_packaged_python_model.

tru.add_packaged_python_model("ensemble_model", packaged_model_path)

INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!

Uploading custom_regressor.p (423.8KiB) -- ### -- file upload complete.
Uploading conda.yaml (133.0B) -- ### -- file upload complete.
Uploading MLmodel (228.0B) -- ### -- file upload complete.
Uploading custom_template_regression_predict_wrapper.py (325.0B) -- ### -- file upload complete.
Uploading custom_template_regression_predict_wrapper.cpython-310.pyc (935.0B) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Model "ensemble_model" is added and associated with remote data collection "sklearn_data". "ensemble_model" is set as the model for the workspace context.

Model uploaded to: http://localhost:8000/home/p/CustomModelWrapperExample/m/ensemble_model/

WARNING:truera.client.intelligence.remote_explainer:Background split for `data_collection` "sklearn_data" is currently not set. Setting it to "all"
INFO:truera.client.remote_truera_workspace:Triggering computations for model predictions on split all.
INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "sklearn_data". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "ensemble_model".

Access model predictions and influences¶

We can hit the UI link above to access information about your uploaded model, but we can also start your analysis directly from the Python SDK. Now that the model is uploaded, we can grab model outputs like predictions and feature influences from the TruEra system:

tru.set_data_split(tru.get_data_splits()[0])
tru.compute_predictions(0, 10)
tru.get_ys_pred(0, 10)

INFO:truera.client.truera_workspace:Download temp_dir: /tmp/tmp4u4dzmuj
INFO:truera.client.truera_workspace:Syncing data collection "sklearn_data" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "sklearn_data". 
INFO:truera.client.truera_workspace:Syncing data split "all" to local.
INFO:truera.client.local.local_truera_workspace:Data split "all" is added to local data collection "sklearn_data", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing model ensemble_model to local.
INFO:truera.client.local.local_truera_workspace:Model "ensemble_model" is added and associated with local data collection "sklearn_data". "ensemble_model" is set as the model for the workspace context.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:The previous data collection ("sklearn_data") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "sklearn_data". 
INFO:truera.client.local.local_truera_workspace:Setting local model context to "ensemble_model".

	__truera_prediction__
2211ec2e-444a-451b-b204-d98b9593d146
5329	2.916948
1970	1.245052
20074	1.478647
18829	0.653502
9700	1.709219
4612	2.251629
6095	1.974272
11610	3.252408
16774	2.492151
4897	1.462586

tru.compute_feature_influences(0, 10)

INFO:truera.client.truera_workspace:Download temp_dir: /tmp/tmp4u4dzmuj
INFO:truera.client.truera_workspace:Syncing data collection "sklearn_data" to local.
WARNING:truera.client.truera_workspace:Data split "all" exists in local. Skipping.
WARNING:truera.client.truera_workspace:Model "ensemble_model" already exists in local. Skipping.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
WARNING:truera.client.local.intelligence.local_explainer:Background split for `data_collection` "sklearn_data" is currently not set. Setting it to "all"

|          | 0.000% [00:00<?]

Uploading tmpo7xrxab1.parquet (7.3KiB) -- ### -- file upload complete.
Put resource done.

INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
2211ec2e-444a-451b-b204-d98b9593d146
5329	-0.466371	-0.005681	0.096497	0.062989	0.010212	0.313065	0.852138	-0.041220
1970	0.021534	-0.080446	0.020525	-0.005404	0.003543	-0.027333	-1.078163	0.343200
20074	-0.002819	-0.017145	0.013989	0.261323	-0.009127	-0.001227	-1.127563	0.322442
18829	-0.401656	-0.013827	-0.019198	0.038256	-0.003111	-0.020035	-2.249670	1.232295
9700	-0.410976	0.021429	-0.005982	-0.014728	0.004310	-0.078340	-0.606478	0.750406
4612	-0.545172	-0.040164	0.265856	0.011569	0.017463	0.108899	0.715300	-0.326737
6095	0.001393	0.039902	0.010634	0.006687	-0.002032	-0.158460	0.584464	-0.605674
11610	0.967600	-0.094058	0.104133	-0.028412	-0.015054	-0.038764	0.707584	-0.465888
16774	-0.128434	-0.009906	0.013805	-0.018367	0.015722	-0.053234	-0.839429	1.409132
4897	-0.565491	0.019805	0.052607	-0.001495	0.007384	-0.168918	0.609027	-0.569837