Ingesting Custom Python Models via Custom Model Wrapper¶
This notebook gives an example of ingesting a custom ensemble Python model through the Python SDK via creating a custom model wrapper.
Before you begin¶
- Install the TruEra Python SDK
What we'll cover ☑️¶
- Packaging/serializing a custom Python model (i.e., not a standard
xboost
,scikit-learn
, etc. model) by creating our own custom model wrapper. - Locally verify that this packaged model works.
- Ingest the model.
Connecting to your TruEra endpoint¶
- Provide the TruEra URL.
- Provide your authentication token (see Quickstart).
- Create your TruEra workspace object!
# Some preliminaries we'll need for the model...
import pandas as pd
import pickle
import platform
import numpy as np
import sklearn
import xgboost
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# FILL ME IN!
TRUERA_URL = "<TRUERA_URL>"
TOKEN = "<TOKEN>"
from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication
auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)
Add a project with some data¶
Before ingesting a model, it is recommended that you first ingest data. This will help validate the model later on. Here, we'll use the sample "California Housing" data from scikit-learn
.
# First, let's create a regression project.
tru.add_project("CustomModelWrapperExample", score_type="regression")
# Next, let's retrieve our data from sklearn and split our data into train and test splits.
data_bunch = fetch_california_housing()
XS_ALL = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_ALL = pd.DataFrame(data_bunch["target"], columns=["label"])
XS_TRAIN, XS_TEST, YS_TRAIN, YS_TEST = train_test_split(XS_ALL, YS_ALL, test_size=100)
data_all = XS_ALL.merge(YS_ALL, left_index=True, right_index=True).reset_index(names="id")
data_test = XS_TEST.merge(YS_TEST, left_index=True, right_index=True).reset_index(names="id")
from truera.client.ingestion import ColumnSpec
column_spec = ColumnSpec(
id_col_name="id",
pre_data_col_names=XS_ALL.columns.to_list(),
label_col_names=YS_ALL.columns.to_list()
)
# And then upload them to TruEra's system.
tru.add_data_collection("sklearn_data")
tru.add_data(data_all, data_split_name="all", column_spec=column_spec)
tru.add_data(data_test, data_split_name="test", column_spec=column_spec)
Train the model¶
We're now ready to train a model. We'll create a simple ensemble regressor that averages predictions between two base models:
* An xgboost.XGBRegressor
boosted decision tree
* An sklearn.linear_model.LinearRegression
linear regressor
class CustomEnsembleRegressor:
def __init__(self):
self.base_model1 = xgboost.XGBRegressor() # xgboost regressor
self.base_model2 = LinearRegression() # sklearn linear regression
self.np = np
self.weights = [0.5, 0.5]
def train_model(self, x: pd.DataFrame, y: np.ndarray):
self.base_model1.fit(x, y)
self.base_model2.fit(x, y)
def get_ensemble_predictions(self, x: pd.DataFrame) -> pd.DataFrame:
base_model1_preds = self.base_model1.predict(x).flatten()
base_model2_preds = self.base_model2.predict(x).flatten()
return self.np.average(self.np.vstack((base_model1_preds, base_model2_preds)), axis=0, weights=self.weights)
c = CustomEnsembleRegressor()
c.train_model(XS_TRAIN, YS_TRAIN)
Package the model¶
Now that the model is trained, let's package it. There are a few paths to do so, but in this notebook, we'll package our model by generating a template wrapper that we will manually edit. First, let's serialize the trained model object:
model_path = "/tmp/custom_regressor.p"
import cloudpickle
with open(model_path, "wb") as h:
cloudpickle.dump(c, h)
Now, let's generate a template wrapper in a path of our choice, and pass in the pip
dependencies we think we'll need to load the model. We do this with the tru.create_packaged_python_model
command.
This generates a sequence of files like so:
📂 DIRECTORY OF PACKAGED MODEL
┣ 📜`conda.yaml` - controls the environment that the model is loading (NEEDS TO BE MODIFIED)
┣ 📜`MLmodel` - manifest file with information about the model type
┣ 📜`model.pkl` - serialized model, such as a pickle
┗ 📂`code`
┗ 📜`model_wrapper.py` - entry point to load and launch the model (NEEDS TO BE MODIFIED)
You can move any code needed to load the model into the code folder, edit the model_wrapper.py
file to expose the necessary functions, and add any necessary dependencies to the conda.yaml
file.
import uuid
packaged_model_path = f"/tmp/custom_model_wrapper_example_{uuid.uuid4()}" # empty directory to package model in
tru.create_packaged_python_model(
packaged_model_path,
model_path=model_path,
python_version=platform.python_version(),
additional_pip_dependencies=[
f"xgboost=={xgboost.__version__}",
f"scikit-learn=={sklearn.__version__}",
f"cloudpickle=={cloudpickle.__version__}"
])
Let's check out the packaged model path... we can see that there's a directory structure here with a few key fles:
* Our model wrapper itself, in code/custom_template_regression_predict_wrapper.py
* Our conda.yaml
file, which specifies the environment for the model
* Our serialized model itself, in custom_regressor.pkl
! tree {packaged_model_path}
First, let's sanity check that the conda.yaml
looks good to go:
! cat {packaged_model_path}/conda.yaml
Great! Our dependencies were picked up. Now, let's check out what the template wrapper looks like right now...
! cat {packaged_model_path}/code/custom_template_regression_predict_wrapper.py
There are a few things that we need to modify. First, the predict
function isn't implemented, so we'll need to go ahead and do that. Second, the get_model
function doesn't apply to our custom model, so we can remove that signature. Finally, we'll need to make sure the wrapper deserializes our serialized model. Let's instead rewrite the wrapper to look like the following:
wrapper_text = """
import cloudpickle
class PredictWrapper(object):
def __init__(self, model):
self.model = model
def predict(self, model_input):
return self.model.get_ensemble_predictions(model_input)
def _load_pyfunc(path: str):
with open(path, "rb") as f:
return PredictWrapper(cloudpickle.load(f))
"""
with open(f"{packaged_model_path}/code/custom_template_regression_predict_wrapper.py", "w") as h:
h.write(wrapper_text)
Uploading the packaged model¶
We're almost there -- let's locally verify the packaged model and then upload it.
Local verification will attempt to load the model, call its predict function, and test that the output matches what's expected for the project type. It will not attempt to create the conda
environment necessary to run your model, so you may want to independently verify that the conda.yaml
looks correct.
tru.verify_packaged_model(packaged_model_path)
Things look good, so finally, let's upload the packaged model via tru.add_packaged_python_model
.
tru.add_packaged_python_model("ensemble_model", packaged_model_path)
Access model predictions and influences¶
We can hit the UI link above to access information about your uploaded model, but we can also start your analysis directly from the Python SDK. Now that the model is uploaded, we can grab model outputs like predictions and feature influences from the TruEra system:
tru.set_data_split(tru.get_data_splits()[0])
tru.compute_predictions(0, 10)
tru.get_ys_pred(0, 10)
tru.compute_feature_influences(0, 10)