Skip to content

Reading Data from DataFrames

First things first

The TruEra client appropriate to your environment must be installed before proceeding with data ingestion. If you have not yet installed the client, see Installation and Access for guidance.

This section covers how Pandas DataFrames can be ingested into the TruEra ecosystem using add_data() and add_production_data(). A working knowledge of project, data collection, and data split structures is recommended (see Project Structure for a quick review).

Important Classes

Depending on your use case, ingesting data locally from a
Pandas DataFrame

Pandas DataFrame

A two-dimensional data structure in tabular format organizing the data into rows and columns.
hinges on one or more of three important SDK
classes

Python Class

A means of bundling data and functionality together; each class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.
:

Click a class above to see its parameters.

Ingesting Pre-production Data

Feature Data, Extra Data, and Labels

Let's assume that we've already created a project and specified a data collection. You can now set them for the current workspace.

tru.set_project(project_name)
tru.set_data_collection(data_collection_name)

Next, create a basic DataFrame with an id column, feature data, extra data, and labels using the following code strictly as an example.

import pandas as pd

data = pd.DataFrame({
    "id": ["1", "2", "3"],
    "feature1": ["a", "a" ,"b"],
    "feature2": [0.5, 0.7, 0.8],
    "extra": [True, True, False],
    "label": [1, 1, 0],
})

Ingest this using the tru.add_data method, which creates a data split based on the data_split_name and data you specify. Specify columns with the ColumnSpec class.

from truera.client.ingestion import ColumnSpec

tru.add_data(
    data=data,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=["feature1", "feature2"],
        extra_data_col_names=["extra"],
        label_col_names="label"
    )
)

Alternatively, use a Python dictionary to specify your columns in lieu of the ColumnSpec class.

tru.add_data(
    data=data,
    data_split_name="split_1",
    column_spec={
        "id_col_name":"id",
        "pre_data_col_names":["feature1", "feature2"],
        "extra_data_col_names":["extra"],
        "label_col_names":"label"
    }
)

Verify successful ingestion with the following code:

tru.get_xs(extra_data=True)
tru.get_ys()

Ingesting Model Predictions

When ingesting model predictions after a model is trained, you need to associate the predictions with a model. Assuming you've already created one, set the model for your workspace by replacing model_1 with the name of your model.

tru.set_model("model_1")

Next, define the predictions and ingest them using the SDK's add_data() method.

data_pred = pd.DataFrame({
    "id": ["1", "2", "3"],
    "prediction": [1, 0, 0]
})

tru.add_data(
    data=data_pred,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="prediction"
    )
)

Since we have both the project and model defined in the workspace, the workspace automatically infers the model and score type. In cases where an override of these parameters is desired, specify this with the ModelOutputContext class.

from truera.client.ingestion import ModelOutputContext

tru.add_data(
    data=data_pred,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="prediction"
    ),
    model_output_context=ModelOutputContext(
        model_name="model_2",
        score_type="probits"
    )
)

See the Python SDK Technical Reference for additional details.

Ingesting Feature Influences

The method for ingesting feature influences is similar to ingesting model predictions and also requires you to associate the data with a model. Although you could have TruEra compute feature influences, for this example, assume you've already computed influences, which you now wish to ingest. First, define the influences, replacing the example values below with your values, then ingest them using the add_data() method.

data_fi = pd.DataFrame({
    "id": ["1", "2", "3"],
    "feature1": [0.7, 0.4, 0.5],
    "feature2": [0.3, 0.6, 0.5],
})

tru.add_data(
    data=data_fi,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=["feature1", "feature2"]
    )
)

In this case, the workspace automatically inferred information based on its context. In cases for which you want to override these parameters, specify them with ModelOutputContext class.

tru.add_data(
    data=data_fi,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=["feature1", "feature2"]
    )
    model_output_context=ModelOutputContext(
        model_name="model_1",
        score_type="regression",
        background_split_name="split_1",
        influence_type="truera-qii"
    )
)

Again, refer to the Python SDK Technical Reference for additional details.

Alternatively, we can specify both ColumnSpec and ModelOutputContext as Python dictionaries.

tru.add_data(
    data=data_fi,
    data_split_name="split_1",
    column_spec={
        "id_col_name": "id",
        "feature_influence_col_names": ["feature1", "feature2"]
    },
    model_output_context={
        "model_name": "model_1",
        "score_type": "regression",
        "background_split_name": "split_1",
        "influence_type": "truera-qii"
    }
)

If you choose to ingest both feature data and feature influences together in one DataFrame, you can add suffixes to the feature influence columns to avoid duplicate name issues.

Here, make the suffix either _truera-qii_influence or _shap_influence as shown in the following example.

data_all = pd.DataFrame({
    "id": ["1", "2", "3"],

    "feature1": ["a", "a" ,"b"],
    "feature2": [0.5, 0.7, 0.8],

    # Feature influence columns have a suffix
    "feature1_truera-qii_influence": [0.7, 0.4, 0.5],
    "feature2_truera-qii_influence": [0.3, 0.6, 0.5]
})

tru.add_data(
    data=data_all,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=[
            "feature1", 
            "feature2"
        ],
        feature_influence_col_names=[
            "feature1_truera-qii_influence", 
            "feature1_truera-qii_influence"
        ]
    )
)

Production Data

Ingesting production data for monitoring is very similar to ingesting pre-production data. The differences are that:

  • add_production_data() method is used instead of add_data()
  • a timestamp column must be specified in the ColumnSpec
  • no split name is specified, as data is written to a single production data stream
  • all data (including features, labels, and extra data) is associated with a model
production_data = pd.DataFrame({
    ...
    "timestamp": pd.Timestamp.now(),    # Required for monitoring
    ...
})

# The model context is either inferred from the context or explicitly inputted as a ModelOutputContext.
tru.add_production_data(
    data=production_data,
    column_spec=ColumnSpec(
        ...
        timestamp_col_name:"timestamp", # Required for monitoring
        ...
    ),
)

Optional Prediction Tagging for Monitoring Segmentation

Monitoring dashboard panels can be filtered to predictions with a given tag. You can supply tags associated with a prediction in the ColumnSpec provided to add_production_data().

  • Prediction tags can be an arbitrary string up to 30 characters in length.
  • The maximum number of tags attached to a given prediction is 12.
  • The data type of the input column must be either string or a list of strings.
# After batch inference, labels not known
tru.add_production_data(
    pd,
    data_split_name="my-prod-split",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=pre_data_names,
        timestamp_col_name="prediction_time",
        prediction_col_names=["prediction"],
        tags_col_name="tags"
    )
)

# Once labels are known
tru.add_production_data(
    pd,
    data_split_name="my-prod-split",
    column_spec=ColumnSpec(
        id_col_name="id",
        timestamp_col_name="label_time",
        label_col_names=["actual"]
    )
)

Click Next below to continue.