Interesting/High-Error Segments: Finding Hotspots¶

Useful for identifying and isolating interesting and/or high-error data points in order to take targeted, corrective measures during model performance debugging is an analytical technique called Segmentation. It entails dividing and organizing (segmenting) your data into defined groups having one or more characteristics in common to find points at variance with overall model output.

However, manually sifting through a large or small data split for points that can be construed as "interesting" or "high-error" is a time-consuming and often cumbersome task requiring a lot of choices along the way, starting with what constitutes "interesting" and how high is a "high" error rate? In other words, what is your segmentation criteria? Which features should be included that share a value or fall within a particular range?

The TruEra Python SDK's find_hotspots() method simplifies and automates precisely this type of manual exploration.

How does it work?

Given user-specified parameters, find_hotspots() searches features or sets of features in a greedy fashion to return segments that maximize or minimize the "metric of interest" you specify.

Note

Here, the term "hotspots" is interchangeable with "interesting/high error segments.

Pointwise Metric Calculation¶

To calculate a given metric on various segments in a split, it's necessary to establish a pointwise metric over which to aggregate. This could be, for example:

Pointwise classification accuracy – assign the value of 1 to correctly classified points; assign all other points a value of 0.
Pointwise squared error – assign the squared difference between its prediction and its ground truth to each point.
Pointwise precision – two lists are maintained, numerator and denominator; points belonging to either list are assigned a value of 1; all other points receive a value of 0.

Pointwise metrics can then be aggregated for various segments, allowing a comparison across different segments in order to return the most “interesting” ones; where “interesting” correlates to higher or lower depending on the metric of interest.

Pointwise Metric Aggregation: Simple Mean¶

Aggregation of pointwise metrics depends on the metric of interest. For aggregated metrics expressed as the mean of the pointwise metrics, TruEra uses the approach cited next.

Denoting a set of model predictions on a segment as S of size │S │ = N_segment and P(x₁) as the pointwise metric for the i-th point in the segment, the aggregated metric for a segment can be given as:

calculate aggregation of simple mean — ↑ click and hold to enlarge ↑

where s is the 'size exponent' factor to help scale M_segment by the size of the segment.

If a comparison_data_split_name is provided, here's the mean aggregation used:

Denoting split A and split B as the base and comparison splits, respectively, then denoting the size of the segment in these splits as N_A and N_B, the aggregated mean metric can be given as

calculate aggregation of simple mean with comparison split — ↑ click and hold to enlarge ↑

Pointwise Metric Aggregation: Confusion Matrix¶

With respect to metrics derived from the confusion matrix (e.g., precision, recall, true/false positive/negative rate), the aggregation method must take into account the numerator and the denominator of the metric of interest. For instance, if the definition of precision is

definition of precision for confusion matrix — ↑ click and hold to enlarge ↑

then the pointwise metrics denote membership in the numerator (e.g., TP for Precision) and the denominator (e.g., TP + FP for Precision) of the corresponding confusion matrix metric. Therefore, the aggregated metric for the given segment is derived by

Denoting a set of model predictions on a segment as S of size │S │ = N_segment and functions n(x_i) (d(x_i)), which indicate numerator (denominator) membership of the i-th point in the segment, the aggregated metric for a segment can then be given as

calculate aggregation of confusion matrix metrics — ↑ click and hold to enlarge ↑

If a comparison_data_split_name is provided, then the following numerator/denominator aggregation is used.

Denoting the size of the segment in the split as N_A and N_B, the aggregated mean metric can be given as:

calculate aggregation of confusion matrix metrics with a comparison split — ↑ click and hold to enlarge ↑

Parameters¶

Although the find_hotspots() method definition in the Python SDK Technical Reference provides the full list of parameters, here is some additional context:

size_exponent – float in range [0,1] which encourages the method to return smaller segments. Looking at the equation above, we see that size_exponent = 0 results in the mean value over the segment while size_exponent = 1 results in the sum of the pointwise metrics in the segment.
comparison_data_split_name – defaults to None; required for metric_of_interest = UNDER_OR_OVER_SAMPLING; optional for any other metric of interest.
metric_of_interest – name of metric to use when searching for hotspots. Allowable values for different project types are listed in the tables that follow for classification and regression projects, respectively.

Classification Metrics of Interest¶

Metric	Description	Notes
`SEGMENT_GENERALIZED_AUC`	common threshold-independent metric	default
`CLASSIFICATION_ACCURACY`	common classification metric
`LOG_LOSS`	threshold-independent metric
`PRECISION`	confusion-matrix derivative metric
`RECALL`	confusion-matrix derivative metric
`TRUE_POSITIVE_RATE`	confusion-matrix derivative metric
`FALSE_POSITIVE_RATE`	confusion-matrix derivative metric
`TRUE_NEGATIVE_RATE`	confusion-matrix derivative metric
`FALSE_NEGATIVE_RATE`	confusion-matrix derivative metric
`UNDER_OR_OVERSAMPLING`	model-agnostic, data-dependent	experimental^*

Regression Metrics of Interest¶

Metric	Description	Notes
`MEAN_ABSOLUTE_ERROR`	common regression metric	default
`MEAN_SQUARED_ERROR`	common regression metric
`MEAN_SQUARED_LOG_ERROR`	useful when range of target is large
`UNDER_OR_OVERSAMPLING`	model-agnostic, data-dependent	experimental^*

Under/Oversampling Metric of Interest (Experimental)¶

HIGH_UNDER_OR_OVERSAMPLING is a data-dependent metric of interest TruEra exposes to search for interesting segments based on the percentage difference between segment sizes in two data splits. Please note that this metric requires a comparison split in addition to the explainer’s base split.

An under/oversampled segment will have a high size diff (%) value, which is the absolute difference between segment sizes as a percentage of split sizes formally defined as:

where A and B denote two splits to compare, N_segment denotes the number of points in the segment, and N_split denotes the number of points in the split.

Hence, the output of find_hotspots() when metric_of_interest=UNDER_OR_OVERSAMPLING will look similar to this:

Important

The find_hotspots() method requires that a model is defined on the explainer, even though calculating UNDER_OR_OVERSAMPLING does not require a model.

"What If" Metric (Experimental)¶

At one point or another, you'll undoubtedly ask yourself whether it's worth your time to take action on these "interesting" segments. To help you assess a proposed segment's actionability, TruEra denotes a ‘what if’ metric as:

where M denotes the metric on a set of points and N denotes the number of points.

An abstract of the "what if" metric calculation looks like this:

abstract example of what-if calculation — click and hold to enlarge

When requested using the show_what_if_performance parameter, the 'what if' metric is returned in addition to each segment-wise metric (pictured next).

example output for show-what-if-parameter — click and hold to enlarge

Keep in mind that the ‘what if’ metric can only be defined for a viable metric_of_interest expressed as the average of a linear combination of pointwise metrics (e.g., classification accuracy, mean squared error). This means that certain metrics of interest (e.g., AUC, precision) will not return a ‘what if’ metric even if requested, as shown in the next example.

non-viable metric of interest — click and hold to enlarge

Here's the current list of supported "what if" metrics:

Classification

Classification Accuracy
Log Loss

Regression

Mean Absolute Error
Mean Squared Error
Mean Squared Log Error

See Performance Metrics for definitions.

Important Caveats Regarding Web App Support

With release of TruEra v1.33, the TruEra Web App support for find_hotspots() exposes these parameters only:

num_features
minimum_size
metric_of_interest

Also, because the Web App's find_hotspots() workflow does not currently use a comparison split, the experimental metric of interest UNDER_OR_OVERSAMPLING is not enabled.

Click Next below to continue.