Pipeline Models: Ingesting Splits and Models with a Transform¶

This quickstart covers the process of ingesting models and data that use a feature transformation. Knowledge of basic TruEra Python SDK commands is a prerequisite. So, if you haven't done so already, a walkthrough of the Diagnostics Quickstart is recommended. Then, when you're ready, download the sample data (TAR or ZIP) and extract the census_income folder containing the Census Income project from Resources in Account Settings.

Why are feature transformations needed?

When computing influences, you'll want to reason about the model with respect to human-readable input features rather than post-processed model-readable data. As part of the reasoning process, consider a categorical feature taking values \(A\), \(B\), or \(C\). Suppose it's one-hot encoded before being fed into the model as a vector \(\{0, 1\}^3\). Rather than computing influences with respect to this binary vector, it's more helpful to analyze the model according to the human-readable pre-transformed input features \(A, B, C\).

The recommended way of transforming ingested data into model-readable data is by specifying the transformation function/object along with the model when using add_python_model. We will go through this flow below, as well as provide some alternative methods to add transforms.

1. Connect to your TruEra endpoint¶

Provide the TruEra URL.
Provide your authentication token (see Quickstart).
TrueraWorkspace creation will also verify the connectivity to TruEra services.

TRUERA_URL = "<TRUERA_URL>"
TOKEN = "<TRUERA_TOKEN>"
QUICKSTART_CENSUS_INCOME_DIRECTORY = "<QUICKSTART_CENSUS_INCOME_DIRECTORY>"

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

2. Create project and data collection¶

A project is a collection of models and datasets solving a single problem statement. Users can be provided access to collaborate on a project.

tru.add_project("Quickstart_FeatureTransform", score_type="probits")
tru.get_projects()

3. Transform data and train model¶

Let's load our data and one-hot encode all categorical variables to train our model, as an example of a potential transform. As an illustration we train a scikit-learn GradientBoostingClassifier model on pre-processed data here.

import os
import pandas as pd 
import numpy as np 

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

X_raw = pd.read_csv(os.path.join(QUICKSTART_CENSUS_INCOME_DIRECTORY, "data_raw.csv"))
Y = pd.read_csv(os.path.join(QUICKSTART_CENSUS_INCOME_DIRECTORY, "label.csv"))

encoder = OneHotEncoder(handle_unknown='ignore')
categorical_columns = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "native-country"]
transformer = ColumnTransformer([("one hot encoding", encoder, categorical_columns)], remainder="passthrough")
X_num = transformer.fit_transform(X_raw.drop(columns=["id"]))

X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, Y.drop(columns=["id"]), test_size=0.4, random_state=0)
model = GradientBoostingClassifier(n_estimators=50, max_depth=3, subsample=0.7)
model.fit(X_train_num, y_train)
model.score(X_test_num, y_test)

4. Add a split and feature map¶

Now we can upload some data to our data collection to prepare for analyzing the model. We only need to upload the "pre-transformed" raw data, and can encode the transform later when we ingest our model.

However, we must also add a feature map that alerts the system the correct mapping between pre-transform and post-transform columns.

Using the same toy example from before, suppose our pre-processed data contained a single categorical column grade taking values A, B, and C, which then mapped to three post-processed columns representing one-hot encoded versions of these features grade_A, grade_B, grade_C. Then our feature map might look like:

{
    "grade" : ["grade_A", "grade_B", "grade_C"]
}

Let's generate a feature map and add it to this data collection:

import re
columns = [i for i in X_raw.columns if i != "id"]
feature_map = {pre_feature : [] for pre_feature in columns}
pre_cols = list(X_raw.columns)
for post_feature in transformer.get_feature_names_out():
    pattern = re.findall(r"__([A-Za-z0-9\-]+)_?", post_feature)
    feature_map[pattern[0]].append(post_feature)

tru.add_data_collection(
    "data_collection_transform",
    pre_to_post_feature_map=feature_map,
    provide_transform_with_model=True)

And now we'll add a data split as well.

data = X_raw.merge(Y, on="id")

from truera.client.ingestion import ColumnSpec

column_spec = ColumnSpec(
    id_col_name="id",
    label_col_names="label",
    pre_data_col_names=[c for c in X_raw.columns if c != "id"]
)

tru.add_data(
    data, 
    data_split_name = "split-all",
    column_spec=column_spec
)

5. Add transform and model¶

This is the last step before we can start analyzing the model in TruEra dashboards. Our transform and model are simple scikit-learn objects, so you can add it directly to TruEra.

tru.add_python_model("model_with_transform", model, transformer=transformer)

tru.compute_feature_influences(0, 2).T

What if my transform is more complicated?¶

If the transform is not supported in the above way, there are a couple of alternative ways to ingest your model. - You could use create_packaged_python_model to custom package the transform and model. This will give you the ability to update the wrapper yourself. - If it is difficult to encapsulate in a single transform object or function, or it is difficult to expose to the TruEra system, you can manually provide pre-processed (human-readable) and post-processed (model-readable) data to the system, along with a feature map that is a one-to-many mapping between the pre-processed and post-processed columns.

Package and add a model¶

We will package the same model as before using create_packaged_python_model. This is ONLY supported on remote.

If the model is among the supported models, TruEra can automatically package the model in full. We will just need to edit the model wrapper to add our transformation before ingesting it.

First, let's package the base model:

import uuid

packaged_model_path = f"/tmp/model_feature_transform_{uuid.uuid4()}"
tru.create_packaged_python_model(packaged_model_path, model_obj=model)

Checking out what's in our packaged model directory...

! tree {packaged_model_path}
! cat {packaged_model_path}/code/sklearn_classification_predict_wrapper.py

The wrapper looks good but we'll need to go through two additional steps:

First, we'll need to add a method to our wrapper with the following signature:

def transform(self, pre_transform_df: pd.DataFrame) -> pd.DataFrame

We'll also need to serialize our scikit-learn transformer with the packaged model, and make sure we can load it in the model wrapper. We can save the transformer in the same directory as the pickled model:

import pickle

with open(f"{packaged_model_path}/transformer.pkl", "wb") as h:
    pickle.dump(transformer, h)

wrapper_text = """
import pickle
import pandas as pd
import os


class PredictProbaWrapper(object):

    def __init__(self, model, encoder):
        self.model = model
        self.transformer = encoder

    def predict(self, df):
        return pd.DataFrame(
            self.model.predict_proba(df), columns=self.model.classes_
        )

    def get_model(self):
        return self.model

    def transform(self, pre_transform_df):
        return pd.DataFrame(
            data=self.transformer.transform(pre_transform_df).toarray(), 
            columns=self.transformer.get_feature_names_out()
        )

def _load_model_from_local_file(path):
    parent_dir = os.path.dirname(path)
    with open(path, "rb") as f:
        model = pickle.load(f)
    with open(os.path.join(parent_dir, "transformer.pkl"), "rb") as f:
        transformer = pickle.load(f)
    return PredictProbaWrapper(model, transformer)


def _load_pyfunc(path):
    return _load_model_from_local_file(path)
"""

with open(f"{packaged_model_path}/code/sklearn_classification_predict_wrapper.py", "w") as h:
    h.write(wrapper_text)

To test things out locally, let's verify the package model:

tru.verify_packaged_model(packaged_model_path)

Things look good so we can add our packaged model now.

tru.add_packaged_python_model("model_with_packaged_transform", packaged_model_path)

And to make sure things work, let's get some influences with respect to our pre-transformed data!

tru.compute_feature_influences(0, 2).T

Using pre/post data¶

To provide feature transformations this way, you do not need to package the model with any additional transform function. Instead, you will specify separate pre- and post-processed data when creating a split, and add the feature map to the corresponding data collection.

First, let's create a new data collection with the same feature map from before.

tru.add_data_collection(
    "data-collection-with-pre-post",
    pre_to_post_feature_map=feature_map,
    provide_transform_with_model=False
)

Now, let's add a split with our separate pre- and post- dataframes:

X_num_df = pd.DataFrame(X_num.toarray(), columns=transformer.get_feature_names_out())
X_num_df["id"] = X_raw.id

data = X_raw.merge(X_num_df, on ="id").merge(Y, on="id")

column_spec = ColumnSpec(
    id_col_name="id",
    pre_data_col_names=[c for c in X_raw.columns if c != "id"],
    post_data_col_names=[c for c in X_num_df.columns if c != "id"],
    label_col_names=[c for c in Y if c != "id"]
)

tru.add_data(data, data_split_name="split-with-pre-post", column_spec=column_spec)

And lastly, we can add our model from the trained model object-- no need to repackage it with a transform function, because we've already provided this transform implicitly with the pre/post split data.

tru.add_python_model("model_without_transform", model)

And finally, getting some influences...

tru.compute_feature_influences(0, 2).T