Important Concepts¶

A core set of foundational concepts underpin the TruEra Platform and the products and services built upon it. Here's a partial list of the most important concepts.

Data Collection/Dataset¶

A dataset collection or dataset is a particular instance of data that is used for analysis or model building at any given time. Data comes in all different forms, shapes and sizes — from numerical data, categorical data, and text data to image data, voice data, and video data. A data collection can be static (unchanging) or dynamic (changing over time). It can take structured in the form of a table (tabular data) correlating observations (rows) with features (columns), or it can be unstructured, which is basically anything that cannot be organized into a table — unparsed document-based data, network or graph data, image data, video data, audio data, web-based logs, sensor data, and so forth.

Data Visualization¶

Data visualization includes scatter plots, line graphs, bar plots, histograms, qqplots, smooth densities, boxplots, pair plots, heat maps, and more. It is the basis for descriptive analytics used in machine learning for data preprocessing and analysis, feature selection, model building, model testing, and model evaluation.

Data Wrangling¶

An important step in data preprocessing that includes several processes like data importing, data cleaning, data structuring, string processing, HTML parsing, handling dates and times, handling missing data, and text mining, wrangling is the process of converting data from its raw form into an analysis-ready form.

Data Quality¶

Data quality is a measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability, and whether it's up to date. Measuring data quality levels like bias and drift can help you identify data errors that need to be resolved and assess whether the data is fit to serve its intended purpose.

Data Bias¶

Data bias (aka sample selection bias) is the bias introduced into an analysis or machine learning model by improperly selecting the training data in a way that is not representative of the true population statistics. Because the data is not representative, the predictive accuracy of the model may not hold when put into production. To fix, the training methodology must take the bias into account, or else the sampling methodology fixed when creating the splits. There are many forms of data bias, including sampling bias, time interval bias, susceptibility bias, data inclusion bias, and attrition or survivorship bias (Resource: Wikipedia).

Data Drift¶

Data drift has occurred if the distribution of data has changed between the initial baseline dataset provided (the in-sample dataset) and another out-of-sample dataset (a holdout dataset) or an out-of-time dataset (a dataset cut from data from another time period). For example, data may exhibit one distribution in an initial time period when a model is trained, and then the data may shift in its distribution later in time.

The drift can be described as consequential if it shifts in a way that “matters” to the model. We define two metrics that capture aspects of consequential data drift: Model Score Instability (MSI) and Feature Influence Instability (FII). These metrics form the basis of TruEra's stability analysis workflow.

Fairness¶

While models are becoming crucial time savers in decision making, the results produced by the governing algorithms can also show bias in favor of specific groups. The need to measure and maintain "fairness" is therefore of critical importance.

Disparate Impact¶

Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well¹.

Each country has its own regulatory guidance on how to avoid disparate impact. In the US, one such ruling is the 1978 EEOC Uniform Guideline, which provided the basis for the 80% disparate impact rule. In this particular implementation, the following general process is generally followed²:

Calculate the selection rate for each group for each group that makes up > 2% of the applicant pool
Observe which group has the highest selection rate. This is not always the white, male, or “majority” group
Calculate impact ratios by dividing the selection rate of each group by that of the highest group
Determine if the selection rates are substantially different (i.e., impact ratio < .80 or 80%). Groups that are substantially impact are defined to be disparately impacted under this guidance.

Once a group is identified as being disparately impact, it is important then to see if the impact is justified, and what can be done to alleviate this impact.

Note that while the principle of disparate impact is generally accepted, different countries and regulatory bodies may provide different guidance. Please refer to your governance team and the regulators in your jurisdiction for specific guidance.

Stability¶

An ML model is stable if it produces consistent predictions with respect to minute changes in the training dataset. In other words, model output doesn’t change much when the training dataset is modified.

Feature Influence Instability (FII)¶

Informally, FII captures how different the distribution of model influences are for a given feature as we go from the in-sample to the out-of-sample distribution of inputs (train/test vs eval/production, for example). The higher this metric, the more the feature’s influence on the model has changed.

Formally, the FII metric computes the first Wasserstein distance (aka the Earth Mover Distance) between the two distributions of feature influences, one for the in-sample data splits (i.e. train/test) and the other for the out-of-sample split (i.e. holdout) or for the out-of-time split ( i.e. dataset from a later period of time, or from a production time window).

The metric may be interpreted as the average change in feature influences in going from in-sample to out-of-sample data. It can be useful in determining what features are not generalizing well for the model, or are changing in influence over time, causing a degradation in model performance.

The higher this metric, the higher is the difference in feature influence. The FII metric serves a role for ML model stability analysis that is analogous to the Characteristic Stability Index (CSI) metric for scorecard models.

Model Score Instability (MSI)¶

One aspect of consequential data drift is captured by the TruEra Model Score Instability (MSI) metric. Informally, it captures how different the distribution of model scores are as we go from the baseline in-sample data (train/test) to out-of-sample data (for holdout sets) or to out-of-time data (for data sampled during different time periods). The higher the absolute value of this metric, the higher is the difference and hence the stronger is the signal that the model may have become unstable and hence requires a careful examination. A high value for MSI does not necessarily mean that the model is unstable.

The MSI metric serves a role for ML model stability analysis that is analogous to the Population Stability Index (PSI) metric for scorecard models.

Formally, the MSI metric computes the first Wasserstein distance (aka the Earth Mover Distance) between the two distributions of model scores -- in-sample vs out-of-sample or out-of-time. The metric may be interpreted as the average change in model scores between the two data sets, for example.

In addition to the MSI, the TruEra product also provides a visual presentation of the two model score distributions. While the MSI metric summarizes a key aspect of the distance between the two distributions, the visual presentation can be useful to qualitatively understand the nature of the shift in the distributions.

Gini Coefficient¶

The Gini coefficient (also known as Somers’ D) is used in rank statistics and used to measure ordinal association between two possibly dependent variables X and Y. It varies between -1 (when all pairs of X and Y disagree) and +1 (when all comparisons agree). It is used as a quality measure for binary choice, logistic regression and credit risk models³.

Schechtman & Schechtman showed that Gini is linearly related to the AUC metric⁴:

Gini = 2 * AUC - 1

And Idan Schatz has published a good tutorial on Gini’s origins, usage and pitfalls⁵

Feature Influence¶

The Feature Influence (FI) is the contribution a particular feature had on the final model score. The sum of the influences across all features adds up to the model score.

The theoretical underpinnings are based on a paper entitled "Algorithmic Transparency on Quantitative Input Influence". Additionally, a treatise on the application of QII in Default Risk Analysis for finance is available from SSRN.

At its core, QII measures feature influences by intervening on inputs and estimating their Shapley values, representing the features’ average marginal contributions over all possible feature combinations.

This method accounts for the nonlinearity of machine learning models, and can be applied to any class of models without knowing how those models work internally. It also applies to models with probability, log-odds, or regression outputs (unlike many alternative methods in circulation).

Metrics Used in Root Cause Analysis¶

Difference of Means If the output of a model on a set of points from for a protected group G is given by SG = {u1,...,um}, and for the remainder of the population P is given by SP = {v1,...,vm}, then the difference of means bias metric is given by DM(SG, SP).

Wasserstein Distance If the output of a model on a set of points from for a protected group G is given by SG = {u1,...,um}, and for the remainder of the population P is given by SP = {v1,...,vm}, then the Wasserstein distance bias metric is given by WS(SG, SP).

Contribution to Difference of Means If the influence of a feature f of a model on a set of points from for a protected group G is given by IG = {u1,...,um}, and for the remainder of the population P is given by IP = {v1,...,vm}, then the contribution to the difference of means bias metric is given by DM(IG, IP).

Contribution to Wasserstein Distance If the influence of a feature f of a model on a set of points from for a protected group G is given by IG = {u1,...,um}, and for the remainder of the population P is given by IP = {v1,...,vm}, then the contribution to the Wasserstein distance bias metric is given by WS(IG, IP).

Outliers¶

Data points that are very different from the rest of a dataset are called outliers. Outliers may have been included by error due to a malfunctioning sensor, a contaminated experiment, human error in recording data, and the like.

Wikipedia: https://en.wikipedia.org/wiki/Disparate_impact ↩
Source: http://annex.ipacweb.org/library/conf/08/brink.pdf ↩
https://en.wikipedia.org/wiki/Somers%27_D ↩
Schechtman, E., & Schechtman, G. (2016). The Relationship between Gini Methodology and the ROC curve (SSRN Scholarly Paper No. ID 2739245). Rochester, NY: Social Science Research Network. ↩
https://towardsdatascience.com/using-the-gini-coefficient-to-evaluate-the-performance-of-credit-score-models-59fe13ef420 ↩