Reading Data from DataFrames¶

First things first

The TruEra client appropriate to your environment must be installed before proceeding with data ingestion. If you have not yet installed the client, see Installation and Access for guidance.

The following discussion covers various ways in which Pandas DataFrames can be ingested into the TruEra ecosystem. A working knowledge of project, data collection, and data split structures is recommended (see Project Structure for a quick review). Knowledge of basic ingestion commands and concepts covered in Diagnostics Quickstart will also be helpful to you.

Important Classes¶

Depending on your use case, ingesting data locally from a hinges on one or more of three important SDK :

ColumnSpec
NLPColumnSpec
ModelOutputContext

Click a class above to see its parameters.

Ingesting Feature Data, Extra Data, and Labels¶

Let's assume that we've already created a project and specified a data collection. You can now set them for the current workspace.

tru.set_project(project_name)
tru.set_data_collection(data_collection_name)

Next, create a basic DataFrame with an id column, feature data, extra data, and labels using the following code strictly as an example.

import pandas as pd

data = pd.DataFrame({
    "id": ["1", "2", "3"],
    "feature1": ["a", "a" ,"b"],
    "feature2": [0.5, 0.7, 0.8],
    "extra": [True, True, False],
    "label": [1, 1, 0],
})

Ingest this using the tru.add_data method, which creates a data split based on the data_split_name and data you specify. Specify columns with the ColumnSpec class.

from truera.client.ingestion import ColumnSpec

tru.add_data(
    data=data,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=["feature1", "feature2"],
        extra_data_col_names=["extra"],
        label_col_names="label"
    )
)

Alternatively, use a Python dictionary to specify your columns in lieu of the ColumnSpec class.

tru.add_data(
    data=data,
    data_split_name="split_1",
    column_spec={
        "id_col_name":"id",
        "pre_data_col_names":["feature1", "feature2"],
        "extra_data_col_names":["extra"],
        "label_col_names":"label"
    }
)

Verify successful ingestion with the following code:

tru.get_xs(extra_data=True)
tru.get_ys()

Ingesting Model Predictions¶

When ingesting model predictions, a model needs to be associated with the data. Assuming you've already created one, set the model for your workspace by replacing model_1 with the name of your model.

tru.set_model("model_1")

Next, define the predictions and ingest them using the SDK's add_data() method.

data_pred = pd.DataFrame({
    "id": ["1", "2", "3"],
    "prediction": [1, 0, 0]
})

tru.add_data(
    data=data_pred,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="prediction"
    )
)

Since we have both the project and model defined in the workspace, the workspace automatically infers the model and score type. In cases where an override of these parameters is desired, specify this with the ModelOutputContext class.

from truera.client.ingestion import ModelOutputContext

tru.add_data(
    data=data_pred,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="prediction"
    ),
    model_output_context=ModelOutputContext(
        model_name="model_2",
        score_type="probits"
    )
)

See the Python SDK Technical Reference for additional details.

Ingesting Feature Influences¶

The method for ingesting feature influences is similar to ingesting model predictions. Although you could have TruEra compute feature influences, for this example, assume you've already computed influences, which you now wish to ingest. First, define the influences, replacing the example values below with your values, then ingest them using the add_data() method.

data_fi = pd.DataFrame({
    "id": ["1", "2", "3"],
    "feature1": [0.7, 0.4, 0.5],
    "feature2": [0.3, 0.6, 0.5],
})

tru.add_data(
    data=data_fi,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=["feature1", "feature2"]
    )
)

In this case, the workspace automatically inferred information based on its context. In cases for which you want to override these parameters, specify them with ModelOutputContext class.

tru.add_data(
    data=data_fi,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=["feature1", "feature2"]
    )
    model_output_context=ModelOutputContext(
        model_name="model_1",
        score_type="regression",
        background_split_name="split_1",
        influence_type="truera-qii"
    )
)

Again, refer to the Python SDK Technical Reference for additional details.

Alternatively, we can specify both ColumnSpec and ModelOutputContext as Python dictionaries.

tru.add_data(
    data=data_fi,
    data_split_name="split_1",
    column_spec={
        "id_col_name": "id",
        "feature_influence_col_names": ["feature1", "feature2"]
    },
    model_output_context={
        "model_name": "model_1",
        "score_type": "regression",
        "background_split_name": "split_1",
        "influence_type": "truera-qii"
    }
)

If you choose to ingest both feature data and feature influences together in one DataFrame, you can add suffixes to the feature influence columns to avoid duplicate name issues.

Here, make the suffix either _truera-qii_influence or _shap_influence as shown in the following example.

data_all = pd.DataFrame({
    "id": ["1", "2", "3"],

    "feature1": ["a", "a" ,"b"],
    "feature2": [0.5, 0.7, 0.8],

    # Feature influence columns have a suffix
    "feature1_truera-qii_influence": [0.7, 0.4, 0.5],
    "feature2_truera-qii_influence": [0.3, 0.6, 0.5]
})

tru.add_data(
    data=data_all,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=[
            "feature1", 
            "feature2"
        ],
        feature_influence_col_names=[
            "feature1_truera-qii_influence", 
            "feature1_truera-qii_influence"
        ]
    )
)

Ingesting Production Data for Monitoring¶

Ingesting production data for monitoring is almost identical to the previous examples. with the only differences being that the add_production_data() method is used instead and that a timestamp column must be specified.

production_data = pd.DataFrame({
    ...
    "timestamp": pd.Timestamp.now(),    # Required for monitoring
    ...
})

tru.add_production_data(
    data=production_data,
    column_spec=ColumnSpec(
        ...
        timestamp_col_name:"timestamp", # Required for monitoring
        ...
    ),
)

Click Next below to continue.

ColumnSpec✕

The ColumnSpec class is imported from truera.client.ingestion and correlates the columns in your data with a specific kind of data. In most cases, a Python dictionary can be passed as an argument in place of a ColumnSpec.

The following table defines each ColumnSpec parameter and whether it's required or optional.

Parameter	Column Description	Required?
`id_col_name`	Corresponds to unique `id` of each data point	Yes
`timestamp_col_name`	Corresponds to timestamp of each data point	Yes, in production data
`pre_data_col_names`	Column(s) corresponding to feature data. If `post_data_col_names` are not provided, `pre_data_col_names` columns are assumed to be both human- and model-readable	Optional
`prediction_col_names`	Column corresponding to model predictions	Optional
`label_col_names`	Column(s) corresponding to labels or ground truth data	Optional
`extra_data_col_names`	Column(s) corresponding to columns not used/consumed by the model, but which could be used for other analysis, such as defining segments	Optional
`feature_influence_col_names`	Column(s) corresponding to feature influence data; can be suffixed with `_truera-qii_influence` or `_shap_influence` to prevent duplicate name issues	Optional
`tags_col_name`	Column corresponding to tags attached to the data	Optional

Parameter	Column Description	Required for
`ranking_group_id_column_name`	Non-unique Group ID to which each item belongs	• ingesting predictions • calculating ranking metrics
`ranking_item_id_column_name`	Non-unique Item ID of the record	• ingesting predictions
`id_column_name`	Unique ID of the record	ingesting anything, including predictions
`label_column_name`	Relevance of the item to the query (of type `double`).	• ingesting predictions • calculating ranking metrics
`prediction_column_names`	List of columns associated with ranking the model’s outputs. This is either of type `double` (assuming `score_type="RANKING_SCORE"`) or integer(assuming `score_type="RANK"`)	• ingesting predictions • calculating ranking metrics

Important: Item ID (ranking_item_id_column_name) and the TruEra ID (id_column_name) are not the same. The former identifies the item that the current record is associated with. It can be duplicated across different records. The latter uniquely identifies the record and cannot be duplicated.

NLPColumnSpec✕

The NLPColumnSpec class is a variant of the ColumnSpec class for NLP projects and is also imported from truera.client.ingestion. It serves the same function as ColumnSpec for correlating text data. Like the ColumnSpec, a Python dictionary can be passed as an argument in place of a NLPColumnSpec.

The following table defines each NLPColumnSpec parameter and whether it's required or optional.

Parameter	Column Description	Required?
`id_col_name`	Corresponds to unique `id` of each data point	Yes
`timestamp_col_name`	Corresponds to timestamp of each data point	Yes, in production data
`text_col_name`	Column corresponding to raw text data	Optional
`prediction_col_name`	Column corresponding to model predictions	Optional
`label_col_name`	Column corresponding to labels or ground truth data	Optional
`extra_data_col_names`	Column(s) corresponding to columns not used/consumed by the model, but which could be used for other analysis, such as defining segments	Optional
`token_influence_col_name`	Column corresponding to token-level influence data; can be suffixed with `_truera-qii_influence` or `_shap_influence` to prevent duplicate name issues	Optional
`tags_col_name`	Column corresponding to tags attached to the data	Optional
`token_col_name`	Corresponds to the tokenized version of `text_col_name`	Optional
`embeddings_col_name`	Corresponds to sentence embeddings for each record	Optional

ModelOutputContext✕

The ModelOutputContext class is provided along with any data during ingestion to provide contextual information about the data. For example, predictions need to be associated with a specific model, which can be done with the model_name parameter of the class. In most cases, the ModelOutputContext is inferred from the workspace context, and should be provided only when there is a need to override certain parameters.

The following table defines each ModelOutputContext parameter and when it's required.

Parameter	Column Description	Required when ingesting
`model_name`	Name of the model	Predictions and influences
`score_type`	Score type of model output e.g., `probits`	Predictions and Influences
`background_split_name`	Name of the background split used for influence computation	Influences
`influence_type`	Algorithm used for influence computation	Influences