Earth Mover's Distance

Minimum amount of work (distance) to match two distributions, normalized by the total weight of the lighter distribution# Exploring Dashboard Panels¶

As discussed in Creating a New Dashboard, panels are organized into distinct categories, compromising:

- Model Output – Regression | Model Output – Classification
- Labels – Regression | Labels – Classification
- Model Performance – Regression | Model Performance – Classification
- Data Input
- Data Quality
- Custom Panels

Each is next discussed in the context of Model Type — Regression, Classification, or Common (applied to both regression and classification analyses).

## Regression Model Panels¶

Regression models describe the relationship between one or more independent variables and a target variable.

A brief description and example of each regression model panel by panel type is covered next.

Tip

You can inspect the selected model for a panel in TruEra Diagnostics by clicking **DIAGNOSTICS ↗** at the top-right of the panel.

By clicking , you can export the panel's content to a CSV file for download and import into a spreadsheet or other third-party application.

### Model Output¶

#### Mean¶

Tracks the average model score for each model over time.

#### Volume¶

Tracks the volume of output for each model over time.

#### Predictions and Labels¶

Tracks model results against the respective ground truth label, when available, over time.

#### Distribution¶

Tracks the distribution of scores for each model and for the labels for those models over the user-defined time range.

#### Drift: Difference of Means¶

Tracks the absolute difference between means (production vs. baseline or known good time window) for each model over time.

#### Drift: Wasserstein¶

### Labels¶

Labels, when ingested and available, indicate the ground truth or some other meaningful measure to track your model's results against.

#### Label Volume¶

Tracks the volume of ground truth labels over time.

#### Label Distribution¶

Reports the percentage of records over time with ground truth labels assigned.

#### Label Drift¶

Calculates the distance between the baseline (training split or known good production time window) and the selected time window using the specified distance metric.

Also known as "annotation drift" and "target drift," label drift is a problem that occurs when the labels or categories associated with a dataset change over time. This can happen for a variety of reasons ranging from changes in human judgment to the introduction of new categories to the merging or splitting of existing categories. It can also be caused by the target population distributions changing over time.

### Model Performance¶

Regression model performance — good or bad, acceptable vs. unacceptable — is reflected in the error rate of the model's predictions. Knowing how well the regression line fits the dataset is another indicator of performance.

A "good" regression model is one for which the difference between the actual or observed values and predicted values for the data introduced is small and unbiased. Hence, to determine "acceptable" performance, the important questions are: Which errors does the model make? Are there specific data segments where the model performs differently? Did anything change compared to training?

Should these and other potential issues arise in production, TruEra Monitoring gives you the option of capturing a subset of your production data time window for RCA in TruEra Diagnostics.

#### RMSE¶

Root Mean Square Error (RMSE), also known as root mean square deviation, is a commonly used regression metric for evaluating the quality of predictions. It measures the Euclidean distance between the prediction and the ground truth, providing an estimation of how accurately the model is able to predict the target value.

Given that MSE values can be too big for an easy comparison, the square root brings it back to the same level of prediction error, making RMSE easier to interpret.

Because RMSE is not scale invariant, however, model comparisons using this measure can be affected by the scale of the data, so it’s generally wise to apply RMSE over standardized data only.

RMSE is helpful when you need a single number with which to judge a model’s performance — during training and cross-validation and for monitoring a production deployment. Keep in mind that squaring numbers and calculating the mean can be heavily affected by a few predictions that deviate from the rest; i.e., outliers that signficantly distort overall model output.

#### WMAPE¶

Sometimes abbreviated as wMAPE — weighted mean absolute percentage error — WMAPE is another metric for evaluating the performance of regression models. A variant of MAPE, absolute percentage errors are weighted by volume for a more rigorous and reliable metric.

WMAPE is typically used to investigate the average error of your model predictions over time compared to what really happens.

## Classification Model Panels¶

Classification models predict a category, classifying a data point into a specific category/class — Yes/No, Spam/Not spam, Eligible/Ineligible, Qualified/Unqualified, etc. — and output a model score, often a probability within the range of [0,1]), after which a decision threshold is applied to yield a decision {0, 1}, generally mapped to classes: {fraud, not fraud}, {spam, not spam}, and so forth. Certain metrics look at the raw model score, while others look at the decision; i.e., after the threshold is applied.

For example, a classification model for a lender might be designed to predict whether a customer is likely to default on a loan based on data contained in a the customer's credit report/payment history when compared to the track record of other borrowers. Or, the prediction could be influenced by factors other than credit score — like education, income, length of current employment, time living at the same address, age, marital status, number of dependents, and so forth.

The simplest metric for model evaluation is performance accuracy — the ratio of the number of correct predictions to the total number of predictions made for a given dataset aggregated over time.

A brief description and example of each classification model panel by panel type is covered next.

Remember, You can inspect the selected model for a panel in TruEra Diagnostics by clicking **DIAGNOSTICS ↗** at the top-right of the panel.

Also, by clicking , you can export the panel's content to a CSV file for download and import into a spreadsheet or other third-party application.

### Model Output¶

The following panels tracking model output can be configured (see Creating a Dashboard).

#### Mean¶

Tracks the average model score for each model over time.

#### Volume¶

Tracks the volume of output for each model over time.

#### Drift: Difference of Means¶

Tracks the absolute difference between means (production vs. baseline or known good time window) for each model over time.

#### Drift: Wasserstein¶

Earth Mover's Distance

Minimum amount of work (distance) to match two distributions, normalized by the total weight of the lighter distribution#### Distribution¶

Tracks the distribution of scores for each model and for the labels for those models over time.

#### Model Decisions and Labels By Class¶

Tracks the distribution of model decisions (post-decision threshold) for all models, and the distribution of labels for each model.

#### Class Distribution¶

Tracks the ercentage of model decisions assigned to the target class.

#### Model Score Distribution¶

The **Model Score Distribution** panel compares labels and model output to check model accuracy along the distribution — **Min** (minimum), **5th Pctl** (percentile), **25th Pctl**, **Median** (mean), **75th Pctl**, **95th Pctl**, and **Max** (maximum). For classification, this means tracking the percentage of model decisions assigned to the target class.

Following the empirical rule — i.e., all data in a *normal* distribution will fall within three standard deviations of the mean (median) — percentiles express the percentage of scores higher than the rest of the population, conveying that data near the mean occur more frequently than data far from the mean. Graphically, this results in a bell-shaped curve, the precise shape of which can vary according to the distribution of the values within the population.

The population is the entire set of data points included in the distribution. The **5th Pctl** reflects values ranking higher than 5% and lower than 95% of the population. Conversely, the **95th Pctl** reflects scores ranking higher than 95% and lower than 5% of the distribution. The **Median** tallies scores that are higher than half of the population and lower than the other half, and so forth — for the **25th Pctl** (higher than 25% of the population, lower than 75%) and the **75th Pctl** (higher than 75% of the population, lower than 25%).

When a model score distribution bears closer investigation, you can inspect a listed model in TruEra Diagnostics by clicking **DIAGNOSTICS ↗** at the top-right of the panel.

### Labels¶

Labels, when ingested and available, indicate the ground truth or some other meaningful measure to track your model's results against.

#### Label Volume¶

Tracks the volume of labels over time.

#### Label Class Distributions¶

Reports the percentage of records over time with ground truth labels assigned to the target class.

### Model Performance¶

Performance tracking for classification models currently supports AUC measurements. Additional metrics for classification monitoring are road-mapped for support soon.

#### AUC¶

Classification model performance is commonly visualized in the area under the curve (AUC), which represents the degree or measure of separability between classes. In other words, the higher the AUC, the better the model is at predicting that a data point properly belonging to the 0 class is classified as 0 and a point belonging to the 1 class is classified as 1.

For instance, a high AUC for a model that classifies whether or not a patient has a certain medical condition is better at distinguishing between "has" and "doesn't have" than a different model running the same data with a lower AUC.

Monitoring production model AUC values over time can reveal performance trends that may bear closer scrutiny with respect to the data, the model or both.

Important

Aggregated accuracy from label and model scores are computed over the specified time range *before* the classification decision. Because labels may be offset from model output, always check that labels exist in the selected time range.

A model's AUC score consistently trending lower may merit closer investigation. Inspect a listed model in TruEra Diagnostics by clicking **DIAGNOSTICS ↗** at the top-right of the panel.

Likewise, by clicking , you can export the panel's content to a CSV file for download and import into a spreadsheet or other third-party application.

## Common Panels¶

These panel categories are shared by both regression and classification models, although there are some differences in panel visualizations.

### Data Input¶

The following panels tracking model input can be configured.

#### Input Volume¶

Tracks the average input volume for each model over time.

#### Data Drift: Difference of Means¶

Tracks drift statistics for each feature as calculated by the distance between the baseline (training split or known good prod time window) and the prod time window using Difference of Means Distance).

#### Data Drift: Wasserstein¶

Drift statistics for each feature as calculated by the distance between the baseline (training split or known good prod time window) and the prod time window using Wasserstein Earthmover’s Distance).

#### Out of Range Values¶

A count of Out of Range errors by numerical feature and per model. A prod numerical feature is considered “Out of Range” if the value lies outside the [min, max] range as observed in the baseline.

Remember, to more closely inspect one of the listed models in TruEra Diagnostics, click **DIAGNOSTICS ↗** at the top-right of the panel.

Click to export the panel's content to a CSV file for download and import into a spreadsheet or other third-party application.

### Data Quality¶

Data Quality, or DQ for short, measures the condition of data processed by a model based on factors of accuracy, completeness, and consistency.

The following panels track the aspect indicated.

#### Unrecognized Categories¶

Tracks the count of unrecognized categorical errors by categorical feature per model. A production categorical feature is considered to have this error when a prod value is one that was not observed in the baseline.

#### Numerical Issues¶

The Numerical Issues panel counts the instances in which numerical features exhibit `NAN`

, `NULL`

, `Inf`

, or `-Inf`

for the time range in question.

#### Schema Mismatch¶

Schema mismatch occurs when the data source has different column (feature) names than those for which the model is configured. In other words, the model cannot match an input value to the data type and/or range it expects for the value — that is, the model and data are out-of-sync.

This panel tracks the count of schema mismatch errors across all features and per model.

#### Missing Values¶

Errors tracked due to the absence of expected records during model processing of ingested data.

#### DQ Exploration¶

Tracks data quality errors per feature per model across the time range specified by the panel’s time selector.

### Custom Panels¶

These are panels you configure to monitor particular segments or to track and report your own user-defined metrics.

#### Segment Performance¶

These custom panels report on the segment(s) you configure for the selected model and time range.

Click **Next** below to continue.