Diagnostics Quickstart Notebook¶

Follow the steps below to use the Python SDK to upload your models and datasets. A sample project is provided to download.

Prerequisites¶

Make sure the Python SDK or CLI client is already installed. See Installation and Access for instructions.

Connecting to TruEra¶

For free users, the TruEra URL takes the form: http://app.truera.net.

TruEra URL¶

For most users, the TruEra URL will take the form http://<your-truera-access-url>. Enterprise users may need to consult with their IT group for specific instructions.

Using the API¶

Depending on your organization's access configuration, connect to TruEra using one of the following methods:

Generating an authentication token¶

Navigate to Users in the TruEra dashboard
Click Generate Credentials
Click Copy Auth Token

Use a username/password¶

Instead, you may use a username/password to connect to your deployment via the SDK or CLI.

Refer to API Authentication for additional details.

Using the Python SDK or CLI¶

Although the connection can be set on individual commands, it is much more convenient to set it once and then verify the connection. Click the tab below for the client you are using (Python SDK or CLI). See TruEra URL above for the typical URI format.

TRUERA_URL = "<TRUERA_URL>"
USERNAME = "<YOUR-USERNAME>"
PASSWORD = "<YOUR-PASSWORD>"
TOKEN = "<YOUR-AUTHENTICATION-TOKEN>"
QUICKSTART_DOWNLOAD_DIRECTORY = "<QUICKSTART_DOWNLOAD_DIRECTORY_FOR_CENSUS_INCOME>"

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication, BasicAuthentication

auth = BasicAuthentication(USERNAME, PASSWORD)
tru = TrueraWorkspace(TRUERA_URL, auth)

"""
for token auth:
auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)
"""

Adding a Project with Sample Data and Model¶

Based on datasets for Census Income and Diabetes Readmission, two sample projects are provided for download from the TruEra Web App (http://<your-truera-deployment-url>/downloads/).

The Census Income project, used throughout this quickstart tutorial, includes a formatted version of the data and a pickled scikit-learn Python model to illustrate the model ingestion process. For other frameworks, the process is similar.

To download the sample model and data set:

Click the ZIP or TAR link in the Web App's Downloads page (select Downloads in the left-side navigator).
Extract the census_income folder to the desired destination directory on your local machine.

Within the census_income folder you'll find the following file structure:

📂 census_income
 ┣ 📜 quickstart_model.pkl
 ┣ 📜 data_raw.csv
 ┣ 📜 data_num.csv
 ┣ 📜 label.csv
 ┣ 📜 extra_data.csv

File content comprises:

quickstart_model.pkl – Pickled Python model for quickstart
data_raw.csv – training data, pre-transformed data (human-readable)
data_num.csv – data in model-readable form
label.csv – single-column containing ground truth labels
extra_data.csv – used for defining segments

Next, to upload both the data and model to TruEra, you'll need to:

Create a TruEra project
Define a data collection
Add split data and labels
Package a trained model
Upload the model
Start using TruEra Diagnostics

Let's go through each step in sequence.

Step 1. Create a TruEra project¶

A TruEra project contains all model versions, datasets and metadata related to the AI/ML project. See Project Structure for a structural introduction and overview. Project names are case-sensitive and invoked many times during model iterations, so a short yet descriptive project name is recommended. Other key parameters in project creation include:

Score type – set to probits or logits for classification projects; set to regression for regression projects.

tru.add_project("AdultCensus_DemoNB", score_type="probits")
tru.get_projects()

The new project now appears is now included in the Web App's project list (select Projects at the top of the left-side navigator).

Step 2. Add a data collection¶

A data collection is an organized inventory of data that can be used for a particular model. A data collection consists of individual data splits, which are slices of input features, labels, and extra metadata. All splits within a given data collection share the same schema - meaning that they all have the same column names and data types. In general, all splits within a model's data collection can be fed into the model. Broadly, a data collection contains:

Data splits – a set of in-sample data (train, test, validate) or out-of-sample (OOS)/out-of-time (OOT) data to test model quality, stability and generalizability.
Feature Metadata – an (optional) set of metadata defining the set of features for a set of splits and the various models trained and evaluated on them; defines the relationship between pre- and post-transform data provides feature groups and descriptions for use throughout the tool.

Because our sample data is already transformed and prepped for the model, we can skip this step. Otherwise, all splits associated with a data collection are assumed to follow the same input format. As a rule of thumb, if a model can read one split in a data collection, it should be able to read all other splits in the same collection.

tru.add_data_collection("demo_data_collection")

Step 3: Adding a split¶

A split is a subset of data points used to train or evaluate a set of models. A data collection can have multiple splits added as needed from flat files, pandas DataFrames, or a variety of external data sources, like S3 blobs. For our sample project, we'll create splits from CSV files.

The following additional data can also be specified:

Labels – the target variable for the model (ground truth) should definitely be specified for training and test data.
Extra – additional feature columns that may be useful even not used by the model; for instance, to define customer or data segments for evaluating fairness.

There is also some mandatory metadata required when adding a split, including the model's split_type, which must be one of all, train, test, validate, or oot. As a general rule of thumb:

train/test/validate – for uploaded data used in model training, testing and validating steps
all – needed when train and test data are combined in a single file
oot – needed when data is from a production stream or for purposes other than model building

For more about split types, see Project Structure.

For our sample project, add a single split specifying label data, split type, and the name of the data's id column using the following command syntax:

import os
import pandas as pd

from truera.client.ingestion import ColumnSpec

# path where you download the census_income quickstart data
X = pd.read_csv(os.path.join(QUICKSTART_DOWNLOAD_DIRECTORY, "data_num.csv"))
Y = pd.read_csv(os.path.join(QUICKSTART_DOWNLOAD_DIRECTORY, "label.csv"))
# Data can be a pandas dataframe
tru.add_data(
    pd.merge(X, Y, on="id"),
    data_split_name="demo-all",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=[c for c in X.columns if c != "id"]
        label_col_names=["label"]
    )
)

Step 4. Create the model¶

Models need to be serialized and packaged according to the specific formats discussed in the Model Packaging and Execution. The Python SDK automatically formats many common Python models — scikit-learn, xgboost, lightgbm, catboost and others — so, for Python models, you can simply pass in your trained model object using the SDK and TruEra will take care of the packaging.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

X = X.set_index('id')
Y = Y.set_index('id')
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
model = GradientBoostingClassifier(n_estimators=50, max_depth=3, subsample=0.7)
model.fit(X_train, y_train)
model.score(X_test, y_test)

With the SDK you can perform advanced verification by loading a few rows of split data to test your model using TrueraWorkspace.verify_packaged_model command for more details.

Step 5. Ingesting the model¶

Using the Python SDK, the new model is added to the current context with the SDK syntax shown under the next Python SDK tab. With the CLI, we need to link the model to the project and data collection we created under Step 1 and Step 2, respectively.

model_name = "quickstart_demo"
tru.add_python_model(model_name, model)

Congratulations! You’ve created a TruEra project and added your first data collection and model.

Step 6. Start your analysis¶

To surface analytics for a model in the Diagnostics product, TruEra triggers computations for model predictions and feature influences. This happens automatically when you first visit the model's page in the dashboard or when you ingest the model using the Python SDK.

If both model and data were properly uploaded and depending on the scale of the data and the complexity of the model, TruEra computations can take from a few minutes to a few hours. When complete, the results are displayed.

If computations fail, see the recommended actions to take in the Model Troubleshooting Guide of your docs.

Otherwise, you can begin your analysis!

# Set a background data split for feature influence computations
tru.set_influences_background_data_split("demo-all")
# Compute predictions, feature influences, and error influences
tru.compute_all()