Pandas DataFrame
A two-dimensional data structure in tabular format organizing the data into rows and columns.Reading Data from DataFrames¶
First things first
The TruEra Pyhonm SDK must be installed before proceeding with data ingestion. If you have not yet installed the SDK, see Installation and Access for guidance.
The following discussion covers various ways in which Pandas DataFrames can be ingested into the TruEra ecosystem. A working knowledge of project, data collection, and data split structures is recommended (see Project Structure for a quick review). Knowledge of basic ingestion commands and concepts covered in the Quickstart will also be helpful.
Important Classes¶
Python Class
A means of bundling data and functionality together; each class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.ColumnSpec¶
ColumnSpec
class is imported from truera.client.ingestion
and correlates the columns in your data with a specific kind of data. In most cases, a Python dictionary can be passed as an argument in place of a ColumnSpec
.
The following table defines each ColumnSpec
parameter.
Parameter | Column Description | Required |
---|---|---|
id_col_name |
Corresponds to unique id of each data point |
Yes |
timestamp_col_name |
Corresponds to timestamp of each data point | Yes in production data. |
pre_data_col_names |
Column(s) corresponding to feature data. If post_data_col_names are not provided, pre_data_col_names columns are assumed to be both human- and model-readable. |
Optional |
post_data_col_names |
Column(s) corresponding to model-readable post-processed data; can be ignored if pre_data_col_names data is provided. |
Optional |
prediction_col_names |
Column(s) corresponding to model predictions. | Optional |
label_col_names |
Column(s) corresponding to labels or ground truth data. | Optional |
extra_data_col_names |
Column(s) corresponding to columns not used/consumed by the model, but which could be used for other analysis, such as defining segments. | Optional |
feature_influence_col_names |
Column(s) corresponding to feature influence data; can be suffixed with _truera-qii_influence or _shap_influence to prevent duplicate name issues. |
Optional |
tags_col_name |
Column corresponding to tags attached to the data. | Optional |
ModelOutputContext¶
The ModelOutputContext
class is provided along with any data during ingestion to provide contextual information about the data. For example, predictions need to be associated with a specific model, which can be done with the model_name
parameter of the class. In most cases, the ModelOutputContext
is inferred from the workspace context, and should be provided only when there is a need to override certain parameters.
The following table defines each ModelOutputContext
parameter.
Parameter | Description | Required when ingesting |
---|---|---|
model_name |
Name of the model | Predictions and influences |
score_type |
Score type of model output e.g. 'probits' |
Predictions and Influences |
background_split_name |
Name of the background split used for influence computation | Influences |
influence_type |
Algorithm used for influence computation | Influences |
Ingesting Feature Data, Extra Data, and Labels¶
Let's assume that we've already created a project and specified a data collection. You can now set them for the current workspace.
tru.set_project(project_name)
tru.set_data_collection(data_collection_name)
Next, create a basic DataFrame with an id
column, feature data, extra data, and labels using the following code strictly as an example.
import pandas as pd
data = pd.DataFrame({
"id": ["1", "2", "3"],
"feature1": ["a", "a" ,"b"],
"feature2": [0.5, 0.7, 0.8],
"extra": [True, True, False],
"label": [1, 1, 0],
})
Ingest this using the tru.add_data
method, which creates a data split based on the data_split_name
and data
you specify. Specify columns with the ColumnSpec
class.
from truera.client.ingestion import ColumnSpec
tru.add_data(
data=data,
data_split_name="split_1",
column_spec=ColumnSpec(
id_col_name="id",
pre_data_col_names=["feature1", "feature2"],
extra_data_col_names=["extra"],
label_col_names="label"
)
)
Alternatively, use a Python dictionary to specify your columns in lieu of the ColumnSpec
class.
tru.add_data(
data=data,
data_split_name="split_1",
column_spec={
"id_col_name":"id",
"pre_data_col_names":["feature1", "feature2"],
"extra_data_col_names":["extra"],
"label_col_names":"label"
}
)
Verify successful ingestion with the following code:
tru.get_xs(extra_data=True)
tru.get_ys()
Ingesting Model Predictions¶
When ingesting predictions, you need to associate the prediction data with an existing model; i.e., a model already added to your TruEra project. Assuming you've already created one, set the model for your workspace by replacing model1
with the name of your model.
tru.set_model("model_1")
This is a model ingested in accordance with Basic Model Ingestion. If you haven't ingested a model, do so now before attempting to add predictions.
If that sounds somewhat recursive — ingesting data without predictions before ingesting the model and then ingesting prediction data after the model — it is. In TruEra, a model must be associated with an ingested data collection and model predictions added later must be associated with an ingested model. Ergo, if predictions aren't included in the initially ingested data collection for the project, they can be added only after the model in ingested.
So, after setting the model, you can now define the predictions and ingest them using the SDK's add_data()
method.
data_pred = pd.DataFrame({
"id": ["1", "2", "3"],
"prediction": [1, 0, 0]
})
tru.add_data(
data=data_pred,
data_split_name="split_1",
column_spec=ColumnSpec(
id_col_name="id",
prediction_col_names="prediction"
)
)
Since we have both the project and model defined in the workspace, the workspace automatically infers the model and score type. In cases where an override of these parameters is desired, specify this with the ModelOutputContext
class.
from truera.client.ingestion import ModelOutputContext
tru.add_data(
data=data_pred,
data_split_name="split_1",
column_spec=ColumnSpec(
id_col_name="id",
prediction_col_names="prediction"
),
model_output_context=ModelOutputContext(
model_name="model_2",
score_type="probits"
)
)
See the Python SDK Technical Reference for additional details.
Ingesting Feature Influences¶
The method for ingesting feature influences is similar to ingesting model predictions. Although you could have TruEra compute feature influences, for this example, assume you've already computed influences, which you now wish to ingest. First, define the influences, replacing the example values below with your values, then ingest them using the add_data()
method.
data_fi = pd.DataFrame({
"id": ["1", "2", "3"],
"feature1": [0.7, 0.4, 0.5],
"feature2": [0.3, 0.6, 0.5],
})
tru.add_data(
data=data_fi,
data_split_name="split_1",
column_spec=ColumnSpec(
id_col_name="id",
feature_influence_col_names=["feature1", "feature2"]
)
)
In this case, the workspace automatically inferred information based on its context. In cases for which you want to override these parameters, specify them with ModelOutputContext
class.
tru.add_data(
data=data_fi,
data_split_name="split_1",
column_spec=ColumnSpec(
id_col_name="id",
feature_influence_col_names=["feature1", "feature2"]
)
model_output_context=ModelOutputContext(
model_name="model_1",
score_type="regression",
background_split_name="split_1",
influence_type="truera-qii"
)
)
Again, refer to the Python SDK Technical Reference for additional details.
Alternatively, we can specify both ColumnSpec
and ModelOutputContext
as Python dictionaries.
tru.add_data(
data=data_fi,
data_split_name="split_1",
column_spec={
"id_col_name": "id",
"feature_influence_col_names": ["feature1", "feature2"]
},
model_output_context={
"model_name": "model_1",
"score_type": "regression",
"background_split_name": "split_1",
"influence_type": "truera-qii"
}
)
If you choose to ingest both feature data and feature influences together in one DataFrame, you can add suffixes to the feature influence columns to avoid duplicate name issues.
Here, make the suffix either _truera-qii_influence
or _shap_influence
as shown in the following example.
data_all = pd.DataFrame({
"id": ["1", "2", "3"],
"feature1": ["a", "a" ,"b"],
"feature2": [0.5, 0.7, 0.8],
# Feature influence columns have a suffix
"feature1_truera-qii_influence": [0.7, 0.4, 0.5],
"feature2_truera-qii_influence": [0.3, 0.6, 0.5]
})
tru.add_data(
data=data_all,
data_split_name="split_1",
column_spec=ColumnSpec(
id_col_name="id",
pre_data_col_names=[
"feature1",
"feature2"
],
feature_influence_col_names=[
"feature1_truera-qii_influence",
"feature1_truera-qii_influence"
]
)
)
Ingesting Production Data for Monitoring¶
Ingesting production data for monitoring is almost identical to the previous examples. with the only differences being that the add_production_data()
method is used instead and that a timestamp column must be specified.
production_data = pd.DataFrame({
...
"timestamp": pd.Timestamp.now(), # Required for monitoring
...
})
tru.add_production_data(
data=production_data,
column_spec=ColumnSpec(
...
timestamp_col_name:"timestamp", # Required for monitoring
...
),
)
Click Next below to continue.