Diagnostics Quickstart Notebook¶
Follow the steps below to use the Python SDK to upload your models and datasets. A sample project is provided to download.
Make sure the Python SDK or CLI client is already installed. See Installation and Access for instructions.
Connecting to TruEra¶
For free users, the TruEra URL takes the form:
For most users, the TruEra URL will take the form
http://<your-truera-access-url>. Enterprise users may need to consult with their IT group for specific instructions.
Using the API¶
Depending on your organization's access configuration, connect to TruEra using one of the following methods:
Generating an authentication token¶
- Navigate to Users in the TruEra dashboard
- Click Generate Credentials
- Click Copy Auth Token
Use a username/password¶
Instead, you may use a username/password to connect to your deployment via the SDK or CLI.
Refer to API Authentication for additional details.
Using the Python SDK or CLI¶
Although the connection can be set on individual commands, it is much more convenient to set it once and then verify the connection. Click the tab below for the client you are using (Python SDK or CLI). See TruEra URL above for the typical URI format.
TRUERA_URL = "<TRUERA_URL>" USERNAME = "<YOUR-USERNAME>" PASSWORD = "<YOUR-PASSWORD>" TOKEN = "<YOUR-AUTHENTICATION-TOKEN>" QUICKSTART_DOWNLOAD_DIRECTORY = "<QUICKSTART_DOWNLOAD_DIRECTORY_FOR_CENSUS_INCOME>"
from truera.client.truera_workspace import TrueraWorkspace from truera.client.truera_authentication import TokenAuthentication, BasicAuthentication auth = BasicAuthentication(USERNAME, PASSWORD) tru = TrueraWorkspace(TRUERA_URL, auth) """ for token auth: auth = TokenAuthentication(TOKEN) tru = TrueraWorkspace(TRUERA_URL, auth) """
Adding a Project with Sample Data and Model¶
The Census Income project, used throughout this quickstart tutorial, includes a formatted version of the data and a pickled
scikit-learn Python model to illustrate the model ingestion process. For other frameworks, the process is similar.
Click the ZIP or TAR link in the Web App's Downloads page (select Downloads in the left-side navigator).
Extract the census_income folder to the desired destination directory on your local machine.
Within the census_income folder you'll find the following file structure:
📂 census_income ┣ 📜 quickstart_model.pkl ┣ 📜 data_raw.csv ┣ 📜 data_num.csv ┣ 📜 label.csv ┣ 📜 extra_data.csv
File content comprises:
- quickstart_model.pkl – Pickled Python model for quickstart
- data_raw.csv – training data, pre-transformed data (human-readable)
- data_num.csv – data in model-readable form
- label.csv – single-column containing ground truth labels
- extra_data.csv – used for defining segments
Next, to upload both the data and model to TruEra, you'll need to:
- Create a TruEra project
- Define a data collection
- Add split data and labels
- Package a trained model
- Upload the model
- Start using TruEra Diagnostics
Let's go through each step in sequence.
Step 1. Create a TruEra project¶
A TruEra project contains all model versions, datasets and metadata related to the AI/ML project. See Project Structure for a structural introduction and overview. Project names are case-sensitive and invoked many times during model iterations, so a short yet descriptive project name is recommended. Other key parameters in project creation include:
- Score type – set to
logitsfor classification projects; set to
regressionfor regression projects.
- Environment – using the Python SDK,
set_environment()sets the current workspace environment. Much like version control software such as git, the workspace environment can be either
remote. When set to
remote, all data collections, splits, models, and computations live inside of an external TruEra deployment. In contrast, in a
localenvironment, any added project artifacts and computations remain local to your machine and do not require a connection. Remote and local projects can always be synced later as needed.
Choosing the appropriate environment
Remote environments are recommended if you wish to:
- Directly ingest models and data into your remote deployment (and this is allowed for your deployment).
- Trigger computations within the remote deployment rather than on your local machine.
- Add artifacts into a remote deployment to later view it in a UI.
Local environments are recommended if:
- You lack access to a remote deployment.
- Your model dependencies are difficult to define / cannot easily run in the remote deployment.
- You wish to trigger computations on your local machine.
- You wish to carry out basic analysis without the overhead of ingesting to a remote deployment.
tru.set_environment("remote") tru.add_project("AdultCensus_DemoNB", score_type="probits") tru.get_projects()
The new project now appears is now included in the Web App's project list (select Projects at the top of the left-side navigator).
Step 2. Add a data collection¶
A data collection is an organized inventory of data that can be used for a particular model. A data collection consists of individual data splits, which are slices of input features, labels, and extra metadata. All splits within a given data collection share the same schema - meaning that they all have the same column names and data types. In general, all splits within a model's data collection can be fed into the model. Broadly, a data collection contains:
- Data splits – a set of in-sample data (train, test, validate) or out-of-sample (OOS)/out-of-time (OOT) data to test model quality, stability and generalizability.
- Feature Metadata – an (optional) set of metadata defining the set of features for a set of splits and the various models trained and evaluated on them; defines the relationship between pre- and post-transform data provides feature groups and descriptions for use throughout the tool.
Because our sample data is already transformed and prepped for the model, we can skip this step. Otherwise, all splits associated with a data collection are assumed to follow the same input format. As a rule of thumb, if a model can read one split in a data collection, it should be able to read all other splits in the same collection.
Step 3: Adding a split¶
A split is a subset of data points used to train or evaluate a set of models. A data collection can have multiple splits added as needed from flat files, pandas DataFrames, or a variety of external data sources, like S3 blobs. For our sample project, we'll create splits from CSV files.
The following additional data can also be specified:
- Labels – the target variable for the model (ground truth) should definitely be specified for training and test data.
- Extra – additional feature columns that may be useful even not used by the model; for instance, to define customer or data segments for evaluating fairness.
There is also some mandatory metadata required when adding a split, including the model's
split_type, which must be one of
oot. As a general rule of thumb:
validate– for uploaded data used in model training, testing and validating steps
all– needed when train and test data are combined in a single file
oot– needed when data is from a production stream or for purposes other than model building
For more about split types, see Project Structure.
For our sample project, add a single split specifying label data, split type, and the name of the data's
id column using the following command syntax:
import os import pandas as pd from truera.client.ingestion import ColumnSpec # path where you download the census_income quickstart data X = pd.read_csv(os.path.join(QUICKSTART_DOWNLOAD_DIRECTORY, "data_num.csv")) Y = pd.read_csv(os.path.join(QUICKSTART_DOWNLOAD_DIRECTORY, "label.csv")) # Data can be a pandas dataframe tru.add_data( pd.merge(X, Y, on="id"), data_split_name="demo-all", column_spec=ColumnSpec( id_col_name="id", pre_data_col_names=[c for c in X.columns if c != "id"] label_col_names=["label"] ) )
Step 4. Create the model¶
Models need to be serialized and packaged according to the specific formats discussed in the Model Packaging and Execution. The Python SDK automatically formats many common Python models —
catboost and others — so, for Python models, you can simply pass in your trained model object using the SDK and TruEra will take care of the packaging.
from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import train_test_split X = X.set_index('id') Y = Y.set_index('id') X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0) model = GradientBoostingClassifier(n_estimators=50, max_depth=3, subsample=0.7) model.fit(X_train, y_train) model.score(X_test, y_test)
With the SDK you can perform advanced verification by loading a few rows of split data to test your model using
TrueraWorkspace.verify_packaged_model command for more details.
Step 5. Ingesting the model¶
Using the Python SDK, the new model is added to the current context with the SDK syntax shown under the next Python SDK tab. With the CLI, we need to link the model to the project and data collection we created under Step 1 and Step 2, respectively.
model_name = "quickstart_demo" tru.add_python_model(model_name, model)
Congratulations! You’ve created a TruEra project and added your first data collection and model.
Step 6. Start your analysis¶
To surface analytics for a model in the Diagnostics product, TruEra triggers computations for model predictions and feature influences. This happens automatically when you first visit the model's page in the dashboard or when you ingest the model using the Python SDK.
If both model and data were properly uploaded and depending on the scale of the data and the complexity of the model, TruEra computations can take from a few minutes to a few hours. When complete, the results are displayed.
If computations fail, see the recommended actions to take in the Model Troubleshooting Guide of your docs.
Otherwise, you can begin your analysis!