Skip to content

Performance Metrics

TruEra supports a wide variety of performance metrics for both classification and regression models. Accuracy checks are configurable. You can change the metrics for the selected project and/or model in Project Settings (pictured next) after signing in at app.truera.net.

project settings
click and hold to enlarge

Note

Measurements are labeled TP (true positive), FP (false positive), TP (true negative), and FN (false negative).

Classification Models

Name Range Interpretation Notes
AUC [0, 1] higher is better Also referred to as AUC-ROC. Threshold-independent.
Segment generalized AUC [0, 1] higher is better A segment generalized version of the AUC metric. Threshold-independent.
Classification accuracy [0, 1] higher is better Requires a model threshold.
Precision [0, 1] higher is better Requires a model threshold. Measures TP / (TP + FP).
Recall [0, 1] higher is better Requires a model threshold. Measures TP / (TP + FN).
F1 Score [0, 1] higher is better Requires a model threshold. Measures the harmonic mean of precision and recall.
True positive rate [0, 1] higher is better Requires a model threshold. Measures TP / (TP + FN). Identical to Recall.
False positive rate [0, 1] lower is better Requires a model threshold. Measures FP / (FP + TN).
True negative rate [0, 1] higher is better Requires a model threshold. Measures TN / (FP + TN).
False negative rate [0, 1] lower is better Requires a model threshold. Measures FN / (TP + FN).
Negative predictive value [0, 1] higher is better Requires a model threshold. Measures TN / (TN + FN).
Average precision [0, 1] higher is better Corresponds to the area under the precision-recall curve. Threshold-independent. For a precise definition, see here.
Jaccard index [0, 1] higher is better Requires a model threshold. Measures the similarity between the label and predicted sets.
Matthew's correlation coefficient [-1, 1] higher is better Requires a model threshold. Calculates the correlation coefficient between the observed and predicted binary classifications. For more details, see here.
Logloss [0, +inf] lower is better Requires model probabilities. Also called logistic or cross-entropy loss. Calculates the negative log likelihood of the classifier given the true label per sample.
Gini coefficient [-1, 1] higher is better Can be derived from ROC-AUC; see Glossary for a formal definition.
Segment generalized Gini [-1, 1] higher is better A segment generalized version of the Gini coefficient metric.
Accuracy ratio [-inf, +inf] higher is better Calculated as Gini score / (1 - P[Y=1])
Segment generalized accuracy ratio [-inf, +inf] higher is better A segment generalized version of the accuracy ratio metric.

Calculation of Segment Generalized Metrics

Many model metrics (e.g., classification accuracy) are calculated as an average of a point-wise metric. Metrics such as AUC, on the other hand, cannot be broken down in a similar way and thus don't naturally generalize to segments of the data. As an illustrative example, consider if we have two segments of the data split: those with label 0 and those with label 1. Both of these segments would have undefined AUC scores regardless of the overall AUC.

To work around this, we generalize the AUC for a segment in the context of a data split in the following way:

\[ E\left[ \begin{align*} &\delta[f(x) < f(\tilde{x}), y = 0] \\ &+ \delta[f(x) > f(\tilde{x}), y = 1] \\ &+ \delta[f(x) = f(\tilde{x})]/2 \end{align*} \middle | (x, y) \in S, (\tilde{x}, \tilde{y}) \in D, y \neq \tilde{y} \right] \]

where \(S\) is the segment, \(D\) is the entire data split, \(\delta\) is the identity function, and \(f\) is the ML model. By doing this, instead of computing a segment's AUC in a vacuum, we contextualize it in terms of the overall data split. This is especially useful for example in determining whether the segment in question is contributing negatively to the overall AUC and to what extent.

The Gini coefficient and accuracy ratio are derived from AUC and so to generalize them to segments we simply use the segment generalized AUC in lieu.

Regression Models

Name Range Interpretation Notes
MSE (mean squared error) [0, +inf] lower is better
RMSE (root mean squared error) [0, +inf] lower is better Equivalent to the square root of the MSE
MAE (mean absolute error) [0, +inf] lower is better
MSLE (mean squared log error) [0, +inf] lower is better Computes a risk metric equivalent to the expected value of the squared logarithmic error/loss; see here for a formal definition. This may be useful if targets grow exponentially.
R^2 (-inf, 1] higher is better Measures the coefficient of determination of the regression model.
Explained variance [0, 1] higher is better Computes the explained variance regression score.
MAPE (mean absolute percentage error) [0, +inf] lower is better
WMAPE (weighted mean absolute percentage error) [0, +inf] lower is better Variant of MAPE in which mean absolute percentage errors are treated as a weighted arithmetic mean; measures the performance of regression or forecasting models.
MPE (mean percentage error) [0, +inf] lower is better

Ranking Models

Name Range Interpretation Notes
NDCG (normalized discounted cumulative gain) [0, 1] higher is better Sums the true scores ranked in the order induced by the predicted scores after applying a logarithmic discount, then divides by the best possible score (Ideal DCG, obtained for a perfect ranking) to obtain a score between 0 and 1.

Click Next below to continue.