Skip to content

TruEra vs SHAP comparison

This notebook is used to benchmark SHAP with TruEra.

To use, fill out the Parameters section and run the notebook.

Main takeaways

The use of the notebook should help corroborate the takeaways in this section.

Advantages of TruEra over SHAP:

  1. Regarding black-box/kernel methods, KernelSHAP is often less accurate and slower than TrueraBlackBox. This is especially true for large comparison groups --- KernelSHAP can be extremely slow in this case.
  2. TreeSHAP is less accurate than TrueraTree when dealing with non-linearities such as explaining probabilities in boosted models. This is especially true when generating explanations for functions with non-linearities, such as thresholding a score to create a classification. For this reason, TreeSHAP doesn't allow explaining classification outcomes.
  3. TreeSHAP is less accurate than TrueraTree for larger intervention set sizes. This is because TreeSHAP automatically samples the intervention set to be of size at most 100.

Further advantages of TruEra over SHAP not seen in the results:

  1. Feature mapping support: For models where the features we want to explain aren't the same as the features fed to the model, we'd generally need to resort to black-box/kernel methods but for many cases TruEra can handle these natively. For example, given a feature grade that takes on values A, B, ..., F, TruEra can compute the QII values for this grade feature natively instead of one for each one-hot encoded feature grade=A, grade=B, ..., grade=F while still utilizing the speed advantages of the tree methods.
  2. TruEra and KernelSHAP (but not TreeSHAP) allows for an accuracy/speed trade-off. That is, by lowering a sampling parameter TruEra allows for faster computations at the cost of lower accuracy. TreeSHAP only gives a biased estimated approximation.
  3. TruEra handles linear models (specifically those of sklearn.linear_model.LogisticRegression and sklearn.linear_model.LinearRegression) natively much like tree models.

Parameters

Instantiate the variables MODEL and XS.

  1. MODEL should be the model itself. For purposes of this exercise, it should be a gradient boosted tree-based model that works with TreeSHAP (i.e. one of XGBClassifier, LGBMClassifier, GradientBoostingClassifier, CatBoostClassifier). This is so we can run the full gamut of performance/accuracy tests.
  2. XS should be a pandas DataFrame that can be fed to the model directly via MODEL.predict_proba(XS). Ensure XS has at least 5K samples so that all experiments can run.
import pandas as pd
from xgboost import XGBClassifier

XS = pd.read_csv("~/Work/data/artifact_repo/fico/datasets/xgboost-dataset/data_num.csv")
YS = pd.read_csv("~/Work/data/artifact_repo/fico/datasets/xgboost-dataset/label.csv", header=None).to_numpy().ravel()
MODEL = XGBClassifier(n_estimators=100, random_state=0, max_depth=3).fit(XS.to_numpy(), YS)

Setup

# Common imports.

import numpy as np
import pandas as pd

from catboost import CatBoostClassifier
from copy import copy
from datetime import datetime
from datetime import timedelta
from lightgbm import LGBMClassifier
from scipy.special import expit
from scipy.special import logit
from shap.explainers._kernel import Kernel as KernelExplainer
from shap.explainers._tree import Tree as TreeExplainer
from sklearn.cluster import KMeans
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from truera_qii.qii.explainers.tree import TreeExplainer as TrueraTreeExplainer
from truera_qii.qii.explainers.blackbox import BlackBoxExplainer as TrueraBlackBoxExplainer
from typing import Callable
from typing import Mapping
from typing import Optional
from typing import Sequence
from typing import Union
from xgboost import XGBClassifier
# Display settings.

pd.set_option('display.max_rows', 5000)

Ensure validity of data and model

# For this test model be such that TreeSHAP works on it. This isn't exhaustive but four common TreeSHAP compatible models.

assert isinstance(MODEL, (
    XGBClassifier,
    LGBMClassifier,
    GradientBoostingClassifier,
    CatBoostClassifier))
# Validate data.

assert isinstance(XS, pd.DataFrame), "XS must be a pandas DataFrame object!"
assert 5000 <= XS.shape[0], "XS should ideally be at least 5K!"
assert MODEL.predict_proba(XS.to_numpy()).shape[0] == XS.shape[0], "Failed to call MODEL's `predict_proba` function!"

Helper methods

# Processing methods.

def process_qiis(a, b):
    a = np.array(a)
    b = np.array(b)
    if len(a.shape) == 3 and a.shape[0] == 2:
        a = a[1, :, :]
    if len(b.shape) == 3 and b.shape[0] == 2:
        b = b[1, :, :]
    assert a.shape == b.shape, "Shapes don't match! {} != {}".format(a.shape, b.shape)
    return a, b

def qii_diff_norm(a, b, ord):
    a, b = process_qiis(a, b)
    return np.linalg.norm(np.array(a - b).ravel(), ord=ord)

def topk_agreement(a, b, k, sort=True):
    a, b = process_qiis(a, b)
    assert len(a.shape) == len(b.shape)
    assert len(a.shape) <= 2 and len(b.shape) <= 2, "a.shape ({}) and b.shape ({}) must be of length <= 2".format(a.shape, b.shape)
    ret = 0
    if len(a.shape) == 2 and a.shape[0] > 1:
        for i in range(a.shape[0]):
            ret += topk_agreement(a[i, :], b[i, :], k, sort)
    else:
        a_idxs = np.argsort(np.abs(a.ravel()))[-k:][::-1]
        b_idxs = np.argsort(np.abs(b.ravel()))[-k:][::-1]
        if not sort:
            # This looks weird that when we have "not sort" we should sort, but this makes sense.
            a_idxs = np.sort(a_idxs)
            b_idxs = np.sort(b_idxs)
        ret = 1 if np.all(a_idxs == b_idxs) else 0
    return ret

def topk_value_ratio(qiis_computed, qiis_true, k):
    a = qiis_computed
    b = qiis_true
    a, b = process_qiis(a, b)
    assert len(a.shape) == len(b.shape)
    assert len(a.shape) <= 2 and len(b.shape) <= 2, "a.shape ({}) and b.shape ({}) must be of length <= 2".format(a.shape, b.shape)
    ret = 0
    if len(a.shape) == 2 and a.shape[0] > 1:
        for i in range(a.shape[0]):
            ret += topk_value_ratio(a[i, :], b[i, :], k)
    else:
        a = np.abs(a.ravel())
        b = np.abs(b.ravel())
        a_idxs = np.argsort(a)[-k:][::-1]
        b_idxs = np.argsort(b)[-k:][::-1]
        num = np.sum([b[i] for i in a_idxs])
        den = np.sum([b[i] for i in b_idxs])
        ret = 1 if den == 0 else (num / den)
    return ret    
# All algorithms.

def _get_model_func(model, score_type):
    if score_type == "logits":
        return lambda xs: logit(model.predict_proba(xs)[:, 1])
    if score_type == "probits":
        return lambda xs: model.predict_proba(xs)[:, 1]
    if score_type == "classification":
        return lambda xs: (model.predict_proba(xs)[:, 1] > 0.5).astype(float)
    assert(score_type in ["logits", "probits", "classification"])

def _get_shap_model_output(score_type):
    if score_type == "logits":
        return "raw"
    if score_type == "probits":
        return "probability"
    if score_type == "classification":
        return "classification"
    assert(score_type in ["logits", "probits", "classification"])

class Algorithms:
    def __init__(self,
                 truera_num_samples: int = 3000,
                 tree_shap_intervention_samples: Optional[int] = None,
                 truera_black_box_use_preprocessing: bool = False):
        self.truera_num_samples = truera_num_samples
        self.tree_shap_intervention_samples = tree_shap_intervention_samples
        self.truera_black_box_use_preprocessing = truera_black_box_use_preprocessing

    def kernel_shap(self, model, score_type, intervention_data, x):
        return KernelExplainer(_get_model_func(model, score_type), intervention_data.to_numpy()).shap_values(x, nsamples=1000)    

    def tree_shap(self, model, score_type, intervention_data, x):
        return TreeExplainer(model, data=intervention_data.to_numpy(), model_output=_get_shap_model_output(score_type)).shap_values(x)

    def random_tree_shap(self, model, score_type, intervention_data, x):
        sampled_intervention_data = intervention_data.to_numpy()
        if self.tree_shap_intervention_samples is not None and self.tree_shap_intervention_samples < intervention_data.shape[0]:
            sampled_intervention_data = intervention_data.sample(self.tree_shap_intervention_samples, random_state=0).to_numpy()
        return TreeExplainer(model, data=sampled_intervention_data, model_output=_get_shap_model_output(score_type)).shap_values(x)

    def centroid_tree_shap(self, model, score_type, intervention_data, x):
        centroids = intervention_data.to_numpy()
        if self.tree_shap_intervention_samples is not None and self.tree_shap_intervention_samples < intervention_data.shape[0]:
            centroids = KMeans(n_clusters=self.tree_shap_intervention_samples, random_state=0).fit(intervention_data.to_numpy()).cluster_centers_
        return TreeExplainer(model, data=centroids, model_output=_get_shap_model_output(score_type)).shap_values(x)

    def truera_tree(self, model, score_type, intervention_data, x):
        return TrueraTreeExplainer(model, data=intervention_data.to_numpy(), model_output=_get_shap_model_output(score_type)).truera_qii_values(x, num_samples=self.truera_num_samples)

    def truera_black_box(self, model, score_type, intervention_data, x):
        explainer = TrueraBlackBoxExplainer(_get_model_func(model, score_type), data=intervention_data.to_numpy())
        if self.truera_black_box_use_preprocessing:
            return explainer.truera_qii_values(x, preprocess=True, num_samples=int(self.truera_num_samples/3))
        else:
            return explainer.truera_qii_values(x, preprocess=False, num_samples=self.truera_num_samples)

    def true(self, model, score_type, intervention_data, x):
        if score_type == "logits" and intervention_data.shape[0] <= 100:
            return TreeExplainer(model, data=intervention_data.to_numpy(), model_output=_get_shap_model_output(score_type)).shap_values(x)
        return TrueraTreeExplainer(model, data=intervention_data.to_numpy(), model_output=_get_shap_model_output(score_type)).truera_qii_values(x, num_samples=max(1000*self.truera_num_samples, 1000000))
# Comparison methods.

# KernelSHAP can take so long sometimes that it's best to limit how many points it'll compute.
KERNEL_SHAP_NUM_POINTS_LIMIT = 10

def create_comparison_df(
        model: Union[XGBClassifier, LGBMClassifier, GradientBoostingClassifier, RandomForestClassifier, CatBoostClassifier],
        score_types: Sequence[str],
        intervention_sizes: Sequence[int],
        x: pd.DataFrame,
        num_points: int,
        algorithm_map: Mapping[str, Callable],
        random_seed: Optional[int] = 0):
    ret = []
    for score_type in score_types:
        for intervention_size in intervention_sizes:
            # Fix seed.
            np.random.seed(random_seed)
            # Convenience variables.
            intervention_data = x
            if intervention_size is not None:
                intervention_data = x.sample(n=intervention_size)
            # Choose points.
            x_sampled = x.sample(n=num_points).to_numpy()
            # Go through algorithms.
            res = {}
            for algorithm in algorithm_map.keys():
                np.random.seed(random_seed)
                start_time = datetime.now()
                x_curr = x_sampled
                if algorithm == "KernelSHAP":
                    x_curr = x_sampled[:KERNEL_SHAP_NUM_POINTS_LIMIT, :]
                qiis = algorithm_map[algorithm](model, score_type, intervention_data, x_curr)
                time = datetime.now() - start_time
                if isinstance(qiis, list) and len(qiis) == 2:
                    qiis = qiis[1]
                print(f"Done {algorithm} in {time}s.")
                res[algorithm] = [qiis, time.total_seconds()]
            # Create column.
            curr = {
                "score type": score_type,
                "intervention size": intervention_size,
            }
            true_qiis = res["True"][0]
            for algorithm in algorithm_map.keys():
                if algorithm == "KernelSHAP":
                    num_points_curr = min(num_points, KERNEL_SHAP_NUM_POINTS_LIMIT)
                    true_qiis_curr = true_qiis[:num_points_curr, :]
                else:
                    num_points_curr = num_points
                    true_qiis_curr = true_qiis
                qiis = res[algorithm][0]
                curr[algorithm + " qiis"] = qiis
                curr[algorithm + " time"] = res[algorithm][1] / num_points_curr
                curr[algorithm + " L_1 error"] = qii_diff_norm(qiis, true_qiis_curr, 1) / num_points_curr
                curr[algorithm + " top 1 disagreement"] = 1 - topk_agreement(qiis, true_qiis_curr, 1) / num_points_curr
                curr[algorithm + " top 3 sorted disagreement"] = 1 - topk_agreement(qiis, true_qiis_curr, 3) / num_points_curr
                curr[algorithm + " top 3 unsorted disagreement"] = 1 - topk_agreement(qiis, true_qiis_curr, 3, False) / num_points_curr     
                curr[algorithm + " top 1 missed value ratio"] = 1 - topk_value_ratio(qiis, true_qiis_curr, 1) / num_points_curr     
                curr[algorithm + " top 3 missed value ratio"] = 1 - topk_value_ratio(qiis, true_qiis_curr, 3) / num_points_curr     
            ret.append(curr)
    return pd.DataFrame.from_dict(ret).T
# Result summarization methods.

def summarize_results(
        special_rows: Sequence[str],
        suffixes_sorted: Sequence[str],
        algorithms: Sequence[str],
        score_type: Optional[str] = None) -> pd.DataFrame:
    sorted_rows = special_rows.copy()
    for suffix in suffixes_sorted:
        for algorithm in algorithms:
            sorted_rows.append(algorithm + " " + suffix)
    ret = df.T[sorted_rows].T
    if score_type:
        ret = ret.T[ret.T["score type"] == score_type].T
    return ret

Experiments

# Parameters.

# The number of points to compute influences for.
NUM_INFLUENCES = 20

# The number of samples to use for TruEra (which are sampling techniques).
NUM_INTERNAL_QII_SAMPLES = 3000

# A map from algorithm names to algorithms to run.
# Many are commented out as they don't add to the story in any meaningful way. That said, feel free to try them out as well.
ALGORITHM_MAP = {
    "KernelSHAP": Algorithms().kernel_shap,
    "TreeSHAP": Algorithms().tree_shap,
    #"RandomTreeSHAP1": Algorithms(tree_shap_intervention_samples=1).random_tree_shap,
    #"RandomTreeSHAP10": Algorithms(tree_shap_intervention_samples=10).random_tree_shap,
    #"RandomTreeSHAP100": Algorithms(tree_shap_intervention_samples=100).random_tree_shap,
    #"RandomTreeSHAP1000": Algorithms(tree_shap_intervention_samples=1000).random_tree_shap,
    #"CentroidTreeSHAP1": Algorithms(tree_shap_intervention_samples=1).centroid_tree_shap,
    #"CentroidTreeSHAP10": Algorithms(tree_shap_intervention_samples=10).centroid_tree_shap,
    #"CentroidTreeSHAP100": Algorithms(tree_shap_intervention_samples=100).centroid_tree_shap,
    #"CentroidTreeSHAP1000": Algorithms(tree_shap_intervention_samples=1000).centroid_tree_shap,
    "TrueraTree": Algorithms(truera_num_samples=NUM_INTERNAL_QII_SAMPLES).truera_tree,
    "TrueraBlackBox": Algorithms(truera_num_samples=NUM_INTERNAL_QII_SAMPLES).truera_black_box,
    #"TrueraBlackBoxWithPreprocessing": Algorithms(truera_num_samples=NUM_INTERNAL_QII_SAMPLES, truera_black_box_use_preprocessing=True).truera_black_box,
    "True": Algorithms().true,
}

# The model-outputs/score-types to run.
SCORE_TYPES = ["probits", "logits"]

# The baseline/intervention/comparison/background data size to run (we will sample the points to reach this many rows).
BASELINE_SPLIT_SIZES = [5000, 500, 1]
# Run experiments.

df = create_comparison_df(
    MODEL,
    SCORE_TYPES,
    BASELINE_SPLIT_SIZES,
    XS,
    NUM_INFLUENCES,
    ALGORITHM_MAP)
df_orig = df.copy()

Results

In this section we breakdown the comparison between TruEra and SHAP. All metrics are averaged per data-point/row and all time is in seconds.

Note that we run experiments for both probability and logit model-outputs/score-types but we can't run it for classification (i.e. whether the probability is over 0.5 or some other threshold) because TreeSHAP can't do this --- KernelSHAP and TruEra can.

Black-box/kernel methods performance comparison

def summarize_results(
        special_rows: Sequence[str],
        suffixes_sorted: Sequence[str],
        algorithms: Sequence[str],
        score_type: Optional[str] = None) -> pd.DataFrame:
    sorted_rows = special_rows.copy()
    for suffix in suffixes_sorted:
        for algorithm in algorithms:
            sorted_rows.append(algorithm + " " + suffix)
    df_filtered = df
    if score_type:
        df_filtered = df.T[df.T["score type"] == score_type].T
    ret = df_filtered.T[sorted_rows].T
    return ret

summarize_results(["intervention size"], ["time"], ["TrueraBlackBox", "KernelSHAP"], score_type="probits")
0 1 2
intervention size 5000 500 1
TrueraBlackBox time 0.065248 0.066718 0.065747
KernelSHAP time 21.849974 2.409709 0.052767

The main takeaway here is that KernelSHAP is far slower on even moderately sized baseline/intervention/comparison/background sets. This essentially makes KernelSHAP practically unuseable.

All methods (black-box/kernel and tree) performance comparison

summarize_results(["intervention size", "score type"], ["time"], ["TrueraBlackBox", "KernelSHAP", "TrueraTree", "TreeSHAP"])
0 1 2 3 4 5
intervention size 5000 500 1 5000 500 1
score type probits probits probits logits logits logits
TrueraBlackBox time 0.065248 0.066718 0.065747 0.064368 0.065615 0.065204
KernelSHAP time 21.849974 2.409709 0.052767 24.317333 2.403169 0.051223
TrueraTree time 0.015084 0.013756 0.01409 0.015421 0.005498 0.002511
TreeSHAP time 0.003711 0.004018 0.002834 0.003068 0.003796 0.002834

There are a few takeaways here: 1. Tree methods are generally much faster than kernel/black-box methods (regardless of whether SHAP or TruEra). 1. TreeSHAP is faster than TruEra for tree algorithms here. 1. We will see though that in section 7.4 this is because TreeSHAP is generally much less accurate. In fact due to the adjustability of allowing for a speed vs accuracy trade-off, TruEra can actually be similar in both accuracy and speed if desired, but for purposes of this demo we focused more on accuracy. 1. When there is no non-linearity (like in the case of logits for boosted tree models) and the baseline/intervention/comparison/background set size is \(\leq\) 100, we use the same algorithm as TreeSHAP as that's quite optimal. As a result, for the last column we see that tree algorithms take roughly the same time.

Black-box/kernel methods accuracy comparison

The main takeaway in this section for both probits and logits is that TruEra and SHAP are comparable in accuracy. The time performance of TruEra is far superior though as seen in section 7.2.

Probits

summarize_results(["intervention size"], ["L_1 error"], ["TrueraBlackBox", "KernelSHAP"], score_type="probits")
0 1 2
intervention size 5000 500 1
TrueraBlackBox L_1 error 0.015883 0.012012 0.009313
KernelSHAP L_1 error 0.019636 0.018917 0.030743

Logits

summarize_results(["intervention size"], ["L_1 error"], ["TrueraBlackBox", "KernelSHAP"], score_type="logits")
3 4 5
intervention size 5000 500 1
TrueraBlackBox L_1 error 0.085304 0.060117 0.048903
KernelSHAP L_1 error 0.074934 0.074097 0.119642

All methods (black-box/kernel and tree) accuracy comparison

Probits

summarize_results(["intervention size"], ["L_1 error"], ["TrueraBlackBox", "KernelSHAP", "TrueraTree", "TreeSHAP"], score_type="probits")
0 1 2
intervention size 5000 500 1
TrueraBlackBox L_1 error 0.015883 0.012012 0.009313
KernelSHAP L_1 error 0.019636 0.018917 0.030743
TrueraTree L_1 error 0.015064 0.013614 0.008665
TreeSHAP L_1 error 0.077259 0.079759 0.110122

Here we see that TreeSHAP is far less accurate than the others. This is because we have a non-linearity (probits for boosted tree models) which essentially makes TreeSHAP incorrect.

Logits

summarize_results(["intervention size"], ["L_1 error"], ["TrueraBlackBox", "KernelSHAP", "TrueraTree", "TreeSHAP"], score_type="logits")
3 4 5
intervention size 5000 500 1
TrueraBlackBox L_1 error 0.085304 0.060117 0.048903
KernelSHAP L_1 error 0.074934 0.074097 0.119642
TrueraTree L_1 error 0.079186 0.0 0.0
TreeSHAP L_1 error 0.357846 0.392497 0.0

Here we see that TreeSHAP is perfectly accurate for the case where we have the baseline/intervention/comparison/background set is of size \(\leq\) 100 and we have no non-linearity (logits are linear for boosted tree models) but otherwise is fairly inaccurate. In fact, for the cases where TreeSHAP is perfectly accurate we use the same algorithm.

Appendix: full results

# Display results.

SPECIAL_ROWS = ["intervention size", "score type"]
SUFFIXES_SORTED = ["time", "L_1 error"]

sorted_rows = SPECIAL_ROWS.copy()
for suffix in SUFFIXES_SORTED:
    for algorithm in ALGORITHM_MAP.keys():
        sorted_rows.append(algorithm + " " + suffix)
df.T[sorted_rows].T
0 1 2 3 4 5
intervention size 5000 500 1 5000 500 1
score type probits probits probits logits logits logits
KernelSHAP time 21.849974 2.409709 0.052767 24.317333 2.403169 0.051223
TreeSHAP time 0.003711 0.004018 0.002834 0.003068 0.003796 0.002834
TrueraTree time 0.015084 0.013756 0.01409 0.015421 0.005498 0.002511
TrueraBlackBox time 0.065248 0.066718 0.065747 0.064368 0.065615 0.065204
True time 15.387591 11.342512 9.825126 0.029718 0.006172 0.002355
KernelSHAP L_1 error 0.019636 0.018917 0.030743 0.074934 0.074097 0.119642
TreeSHAP L_1 error 0.077259 0.079759 0.110122 0.357846 0.392497 0.0
TrueraTree L_1 error 0.015064 0.013614 0.008665 0.079186 0.0 0.0
TrueraBlackBox L_1 error 0.015883 0.012012 0.009313 0.085304 0.060117 0.048903
True L_1 error 0.0 0.0 0.0 0.0 0.0 0.0