QII Computation Parameters¶

This notebook will help you gain an understanding of how to tune the parameters involved in computing QIIs.

There are two main parameters used for QII computation:

number of default influences: The number of data points (aka rows) that we compute QIIs for. When increased, more data points will be used for any QII analysis.
number of samples: A parameter used internally during QII computation. When increased, it increases the accuracy of the approximations.

Both of these parameters are project-level settings which also slow computations proportionally. That is, if the number of samples is doubled, the runtime should double as well.

Changing the parameters in the TruEra Web App¶

In the Web App toolbar, click Project Settings and scroll down to Analysis: img/project_settings_analysis_screenshot.png

Changing the parameters in the Python SDK¶

In the Python SDK, the settings can be accessed and changed via:

number of default influences: get_num_default_influences and set_num_default_influences.
number of samples: get_num_internal_qii_samples and set_num_internal_qii_samples.

Setting Up QII with the SDK¶

First, if you haven't already done so, install the TruEra Python SDK.

Next, set up QII by creating a simple project and model with the data from the sklearn California Housing dataset.

# FILL ME! 

TRUERA_URL = "<TRUERA_URL>"
TOKEN = '<AUTH_TOKEN>'

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from truera.client.ingestion import ColumnSpec

data_bunch = fetch_california_housing()
XS = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS = pd.DataFrame(data=data_bunch["target"], columns=["labels"])

data_all = XS.merge(YS, left_index=True, right_index=True).reset_index(names="id")

tru.add_project("California Housing", score_type="regression")

tru.add_data_collection("sklearn_data")
tru.add_data(
    data_all,
    data_split_name="all",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=XS.columns.to_list(),
        label_data_col_names=YS.columns.to_list()
    )
)

MODEL = GradientBoostingRegressor()
MODEL.fit(XS, YS)

tru.add_python_model("GBM", MODEL)

INFO:truera.client.local.local_truera_workspace:Data split "all" is added to local data collection "sklearn_data", and set as the data split for the workspace context.
INFO:truera.client.local.local_truera_workspace:Model "GBM" is added and associated with local data collection "sklearn_data". "GBM" is set as the model for the workspace context.

First let's use only a single sample with 300 explanation instances.

tru.set_num_internal_qii_samples(1)
tru.set_num_default_influences(300)
tru.get_explainer("all").plot_isp("MedInc")

WARNING:truera.client.local.intelligence.local_explainer:Background split for `data_collection` "sklearn_data" is currently not set. Setting it to "all"

Now compare that to when we use 1000 samples with 1000 explanation instances. The 1000x increase in samples reduces the noise of our estimate and the 300 to 1000 increase in explanation instances increases the number of points.

tru.set_num_internal_qii_samples(1000)
tru.set_num_default_influences(1000)
tru.get_explainer("all").plot_isp("MedInc")