Data Collection
An organized inventory of data consisting of individual data splits used for a particular model.Basic Ingestion Methods¶
Similar to data ingestion, you can use the Python SDK for model ingestion/import with multiple paths depending on whether your model is already packaged and how. If you're a new user, you may want to do a quick review of the general project structure supported by TruEra.
Otherwise, this topic on basic methods includes starter guidance on adding a model, computing and adding predictions, as well as computing and adding feature and error influences.
First things first
Data Split
One of two or more subsets of the data collection. Typically, with a two-part split, one part is used to evaluate or test the data, while the other is used to train the model.Packaged Model
Comprises an executable Python model object, along with a collection of modules arranged in a hierarchy of folders that includes an __init__.py file.Most Commonly Used Methods¶
As briefly introduced in Quickstart for Diagnostics, the most commonly used model ingestion method leverages add_paython_model()
.
from sklearn.ensemble import GradientBoostingRegressor
# instantiate and fit model
gb_model = GradientBoostingRegressor()
gb_model.fit(XS, YS)
# Add to TruEra workspace.
tru.add_python_model("model 1", gb_model)
Natural Language Processing
combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models.# NLP only supports virtual models currently
tru.add_model("model")
Additional methods are discussed next for:
- Packaged Models
- Unpackaged Models (models lacking an executable object)
- Models requiring feature transforms
Packaged Models¶
Use TrueraWorkspace.add_packaged_python_model()
to ingest a packaged model. This method registers and adds the new model, including the executable model object provided, deducing the model framework to appropriately serialize and upload the model object to the TruEra server.
Models of supported frameworks — scikit-learn, XGBoost, LightGBM, CatBoost and PySPark (tree models only) — can be passed directly.
If you are unable to ingest your model using this method due to custom logic, feature transforms, et al, consider using create_packaged_python_model()
.
Models Without an Executable Object¶
Models without an executable object can be ingested by TruEra using the SDK's add_model()
method. However, without the executable object, model predictions and influences must be computed externally, then added.
Add your externally computed predictions for a given split with add_model_predictions()
.
Add externally computed feature influences with add_model_feature_influences()
.
Add externally computed error influences with add_model_error_influences()
.
For more on adding feature and error influences, click here.
See Tutorial: Adding a Model for a notebook tutorial on virtual model ingestion.
Ingesting Feature Transformations¶
Using QII for influences, you can wrap any set of model transformations and provide influences with respect to pre-transformed (human-readable) features. However, this must be done locally and then added.
Principal Component Analysis
Unsupervised learning technique for reducing the dimensionality of data; increases interpretability while minimizing information loss.one-hot encoding
Represents categorical variables as numerical valuesmulti-hot encoding
Binary encoding of multiple tokens in a single vectorTo enable these optimizations, post-ingestion, you have two options (the first is recommended):
-
Python models – if the transformation from raw human-readable data to model-readable data can be expressed as a function, add this function to the packaged model wrapper as an additional
transform
function. Details can be found in Custom Data Transformation. -
Java models – if the transformation cannot be simply expressed as a function, you can instead capture data before and after the transformation and ingest them as pre-transform data and post-transform data.
In both cases, add feature mapping from the columns of pre-transformed data to the post-transformed data. This can be done with the Python SDK during data collection creation.
Post-ingestion Linear and Tree Model Optimizations¶
TruEra optimizations for tree-based and scikit-learn sklearn.linear_mode
are enabled by ingesting the model object directly using the Python SDK. Or, you can add a get_models()
function to your packaged model wrapper (see Model Packaging and Execution).
Click Next below to continue.