Skip to content

Quickstart

Make sure the Python SDK is installed (see Installation and Access), then follow the guidance below to learn how to use the Python SDK to upload sample data and models.

First-time users

TruEra users utilizing Python are strongly encouraged to use the Python SDK to get started.

See Python SDK Tutorial: Local Compute Flow for an in-depth tutorial.

Connecting to TruEra

Connecting to the TruEra platform is a straightforward authentication process using a token. To make the connection, you'll need to provide the TRUER_URL and your TOKEN. The TruEra URL is http://app.truera.net. Replace the <TRUERA_URL> placeholder with this URL in the SDK code snippet below.

To generate your authentication token:

  1. Open your avatar in the Web App toolbar.
  2. Click Authentication, then click GENERATE CREDENTIALS.

    generate auth token credentials example
    click and hold to enlarge

  3. Click the clipboard icon to copy the token.

  4. Replace the <TRUERA_TOKEN> placeholder shown in the SDK code snippet shown below by pasting in the copied token.
TRUERA_URL = "<TRUERA_URL>"
TOKEN = "<TRUERA_TOKEN>"

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

Although the connection can be set on individual commands, you'll find it more convenient to set it once and then verify the connection, as necessary.

Downloading Samples

The sample data and model you'll need for this quickstart is available in the TruEra Web App on the Account settings page (pictured next).

sample quickstart data and model downloads
click and hold to enlarge

For this quickstart, fetch the data collection from the sklearn.datasets module.

# fetch the data from sklearn
from sklearn.datasets import fetch_california_housing

dataset = fetch_california_housing()

Pull the data into a pandas DataFrame.

# format data into a pandas DataFrame
import pandas as pd

feature_col_names = dataset["feature_names"]
label_col_name = dataset["target_names"][0]
XS = dataset["data"]
YS = dataset["target"]
data = pd.DataFrame(data=XS, columns=feature_col_names)
data[label_col_name] = YS
data["id"] = range(len(data))
You're now ready to create a TruEra project and upload both the sample data and model.

Adding a Project with Sample Data and Model

Based on scikit-learn's California housing dataset, the California Housing project is used throughout this quickstart to illustrate the model ingestion process. For other frameworks, the process is similar. A parallel quickstart notebook with different sample projects to practice on is also available as a supplement to the guidance presented here. See also Truera's Python SDK Tutorial specifically covering Local Compute Flow.

Additional Notebooks

Additional Python notebooks can be found in TruEra's Notebook Library.

Step 1. Create a TruEra project

A TruEra project contains all model versions, datasets and metadata related to the AI/ML project. See Project Structure for a structural introduction and overview. Project names are case-sensitive and invoked many times during model iterations, so a short yet descriptive project name is recommended. Other key parameters in project creation include:

  • Score type – set to probits or logits for classification projects; set to regression for regression projects.

  • Input type – For tabular projects, set to tabular. For NLP projects, set to text. Defaults to tabular.

Here's the SDK call:

tru.add_project("California_Housing", score_type="regression")
tru.add_project("Covid_Tweets", score_type="classification", input_type="text")

To confirm, get the project.

tru.get_projects()
['California_Housing']
tru.get_projects()
['Covid_Tweets']

The new project is now included in the Web App's project list. Click Projects at the top of the left-side navigator to see the list.

Step 2. Add a data collection

A data collection is an organized inventory of data that can be used for a particular model. A data collection consists of individual data splits, which are slices of input features, labels, and extra metadata. All splits within a given data collection share the same schema - meaning that they all have the same column names and data types. In general, all splits within a model's data collection can be fed into the model. Broadly, a data collection contains:

  • Data splits – a set of in-sample data (train, test, validate) or out-of-sample (OOS)/out-of-time (OOT) data to test model quality, stability and generalizability.
  • Feature Metadata – an (optional) set of metadata defining the set of features for a set of splits and the various models trained and evaluated on them; defines the relationship between pre- and post-transform data provides feature groups and descriptions for use throughout the tool.

Note

All splits associated with a data collection are assumed to follow the same input format. As a rule of thumb, if a model can read one split in a data collection, it should be able to read all other splits in the same collection.

tru.add_data_collection("sklearn_data")
tru.add_data_collection("pytorch_data")

Step 3: Add split data to your local workspace

A split is a subset of data points used to train or evaluate a set of models. A data collection can have multiple splits added as needed from flat files, pandas DataFrames, or a variety of external data sources, like S3 blobs. For our sample project, we'll create splits from CSV files.

Note

All splits added to TruEra require point identifiers. The column containing those identifiers should be specified using id_col_name, which must be a string.

The following additional data can also be specified:

  • Labels – the target variable for the model (ground truth) should definitely be specified for training and test data.
  • Extra – additional feature columns that may be useful even not used by the model; for instance, to define customer or data segments for evaluating fairness.

There is also some mandatory metadata required when adding a split, including the model's split_type, which must be one of all, train, test, validate, or oot. As a general rule of thumb:

  • train/test/validate – for uploaded data used in model training, testing and validating steps
  • all – needed when train and test data are combined in a single file
  • oot – needed when data is from a production stream or for purposes other than model building

For more about split types, see Project Structure.

For the sample project, add data using the add_data() method. Use the ColumnSpec class from truera.client.ingestion to specify which columns are features, labels, or point ids. See Reading Data from DataFrames for more information.

Here's the syntax:

# Fetch the data from sklearn
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()

# Format data into a pandas DataFrame
import pandas as pd
feature_col_names = dataset["feature_names"]
label_col_name = dataset["target_names"][0]
XS = dataset["data"]
YS = dataset["target"] 
data = pd.DataFrame(data=XS, columns=feature_col_names)
data[label_col_name] = YS
data["id"] = range(len(data))

# Ingest data into Truera
from truera.client.ingestion import ColumnSpec
tru.add_data(
    data,
    data_split_name="all",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=feature_col_names,
        label_col_names=label_col_name
    )
)
import pandas as pd

#fetch the data from sklearn
from sklearn.datasets import fetch_20newsgroups
data_bunch = fetch_20newsgroups()

df = pd.DataFrame({
    "text": data_bunch['data'], 
    "label": data_bunch["target"]
}).reset_index(names="id")

# Ingest to workspace through .add_data()
from truera.client.ingestion import NLPColumnSpec
tru.add_data(
    data=df,
    data_split_name="all",
    column_spec=NLPColumnSpec(
        id_column_name="id",
        text_column_name="text",
        label_col_name="label"
    )
)

Step 4. Add the model to your local workspace

from sklearn.ensemble import GradientBoostingRegressor

# instantiate and fit model
gb_model = GradientBoostingRegressor()
gb_model.fit(XS, YS)

# Add to TruEra workspace.
tru.add_python_model("model 1", gb_model)
# NLP only supports virtual models currently
tru.add_model("model")

When the model is ready, continue to Step #5.

Step 5. Compute and ingest predictions and feature influences

Now that we have ingested the features as well as the model, we can move on to predictions and feature influences.

You will need a method for computing influences available in your python environment. This can be TruEra
QII

Quantitative Input Influence

Calculates the cumulative influence/impact of a group of features.
or
SHAP

SHapley Additive exPlanations

Mathematical method to explain the predictions of machine learning models based on game theory. Used to explain the predictions of any machine learning model by calculating the contribution of each feature to the prediction.
. SHAP is already packaged with your TruEra client.

Use the following syntax to compute and ingest predictions.

# Compute predictions
predictions = pd.DataFrame({
    "predictions": gb_model.predict(XS),
    "id": data["id"]
})

# Ingest predictions
tru.add_data(
    predictions,
    data_split_name="all",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="predictions"
    )
)

Use the following syntax to compute and ingest feature influences.

tru.set_model_execution('local')
# Set a background data split for feature influence computations
tru.set_influences_background_data_split("all")
tru.set_influence_type("shap")

# Compute feature influences
feature_influences = tru.get_feature_influences()
feature_influences.index.name = "id"
feature_influences.reset_index(inplace=True)

# Create ModelOutputContext to describe feature influences
from truera.client.ingestion import ModelOutputContext
model_output_context = ModelOutputContext(
    model_name='model 1',           # The model
    score_type='regression',        # Score type of project
    background_split_name='all',    # Background split
    influence_type='kernel-shap'    # Algorithm used to compute influences
)

# Ingest feature influences
from truera.client.ingestion import ModelOutputContext
tru.add_data(
    feature_influences,
    data_split_name="all",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=feature_col_names
    ),
    model_output_context=model_output_context
)
tru.set_model_execution('local')
# Set a background data split for feature influence computations
tru.set_influences_background_data_split("all")

# Compute feature influences
feature_influences = tru.get_feature_influences()
feature_influences.index.name = "id"
feature_influences.reset_index(inplace=True)

# Ingest feature influences
tru.add_data(
    feature_influences,
    data_split_name="all",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=feature_col_names
    )
)

Congratulations! You’ve created a TruEra project and added your first data collection and model.

Depending on the scale of the data and the complexity of the model, further TruEra computations can take from a few minutes to a few hours. When complete, the results are displayed.

Step 6. Start your analysis

You can now use the TruEra web application to surface analytics for the model.

Click Next below to continue.