Data scientists like to organize things into tables, where a row is an instance/record/ observation/trial representing a single datapoint on a graph, i.e., the person or thing (a single unit thereof) being measured. Each column is an applicable feature or variable (a characteristic) of the row entry — name, age, weight, education, temperature, viscosity, voltage, salinity ... you name it.
But what does it mean?
It means that features constitute what your dataset is comparing, row by row.
It also means that the quality of the features in your dataset impacts the efficacy of the insights gained when the dataset is used for machine learning. Business problems within the same industry do not necessarily require the same features. Everything that can be measured may not be necessary to answer the business problem posed, making it important to have a strong understanding of the business goals and priorities of your modeling project before merely including all available features, i.e., all measurable attributes. Certain attributes simply are not relevant to the analysis being undertaken. In other words, it's often a fine line that separates "too much" from "not enough."
Hence, it's important to weigh the quality of your dataset’s features and their pertinence/relevance to the business problem at hand. Feature selection and feature engineering, notoriously difficult and tedious, are therefore vital to the verity of your model's results. Done well, your continually optimized dataset should contain only those features having a bearing on the problem you're trying to solve. Including more could impose an undue influence on the model. Including less could skew the results in an entirely different way.
TruEra's Features diagnostics can help you determine the "Goldilocks Zone" for your model — not too much, not too little — just right.
Assessing Feature Behavior¶
To begin, click Features under Diagnostics in the nav panel on the left. This displays a page headed by three tabs: Overview, All features, and Feature Groups.
As discussed next, the Overview tab opens by default, showing feature information for the last Model and Split name selected.
- Importance – shows the top-five features ranked by highest average influence measured as L1 norm, calculated as the sum of the absolute vector values or the Manhattan distance from the origin of the vector space.The L1 norm is calculated as the sum of the absolute values of the vector. The L2 norm is calculated as the square root of the sum of the squared vector values.
- Outliers – presents the top five outlier features in terms of:
- Influence – highest influence of a single point, a metric which helps identify those features which, for certain points, drive extreme model decisions.
- Category – calculated as the normalized ratio of the highest and second highest absolute influence categories, these are features for which a single category stands out from all other categories in terms of influence.
- Trend-line – calculated by finding the average residual for the top 10 percentile of points as measured by distance from the trend line and comparing this to the average residual for the rest of the points in order to surface those points that have irregular interactions in that they diverge from a typical trend; i.e., these features have points that lie far from the best-fit spline.
- Trends – ranks the top five features with the highest R-squared value of the best-fit spline (influence variances well-explained by the spline).
See Influence Sensitivity Plots for additional guidance on outliers and trends.
Insight gained at this level of analysis involves the influence of input variables on outputs, measuring the model's behavior in terms of input value change, noise tolerance, data quality, internal structure, and more, as a means of uncovering abnormal or unexpected behavior.
To reveal additional feature detail, click a plot in the overview. This switches the view to the All features tab.
Pictured above, the left-side panel under the All features tab contains the current list of features for the selected Model and Split name. (Remember, you can change the model and/or split at any time.) The feature you clicked in Overview is highlighted.
Influence Sensitivity Plots¶
Here, the right-side panel shows greater detail for the selected feature in the form of an Influence Sensitivity Plot (ISP). ISPs show the relationship between a feature’s value and its contribution to the model output, or its influence — a point-level visualization that allows you to examine any feature in detail. Added in composite overlay to contextualize the ISP by showing particularly sparse or dense regions is a distribution of the feature values (shown at the top) and the distribution of influences (shown on the right).
When enabled within an ISP, you can also examine (click a link for additional detail):
- Best-fit spline – shows a polynomial trend line that is fit to the ISP to tease out a clear relationship between a feature's value and its contribution to the model score.
- Overfitting records – identifies influential points of the ISP occurring in low-density regions. Points that drive model scores despite occurring in low-density regions can indicate overfitting.
- Error rate – shows the model's error for bucketized ranges of the feature's value, which can help contextualize the ISP by sanity checking regions against the error rate graph. Error is measured using either mean absolute error or mean squared error, depending on your project settings.
- Ground truth rate – shows the ground truth rate of the given data split for bucketized ranges of the feature's value. This helps contextualize the ISP with a sanity check of feature behavior for regions associated with ground truth labels.
How is overfitting calculated?
The overfitting diagnostic identifies influential points which occur in low-density or sparse regions of data. Data points with feature values in regions with less than 3% of the data population and with influences at or above the 95th percentile are classified as overfitting. This indicates that the feature value of the point in question is driving a model's prediction, despite being a low-density region of data. If the model overfits on these low-density regions of data, it can fail to generalize, leading to poor test and production performance.
Click SHOW ALL FEATURES to return to the ISPs for all features. You can then sort by submetrics for the primary measurements reported in the Overview tab.
As shown next, the All features tab contains the complete list of features being modeled.
You can control the number of features displayed per page by setting a new value at the bottom of the left-side panel.
Use to toggle between ISP and distribution graphs.
To sort the list:
- Click in the Sort by drop-down and select your criteria.
- Click the arrow on the right to toggle the sort direction.
To focus on a single feature:
Click on the desired feature in the left-side panel.
- Begin typing in the Find features... search box until you see a match.
- Select a search result by clicking it.
The right-side panel is then refreshed with the corresponding information (defined above) for the feature you want to see.
Feature groups comprise individual features that are closely related or which capture similar information (e.g., "demographic features"). When ingested with the model (optional), these groups of features are displayed by group importance under the Feature Groups tab. Hence, if your model has features which are closely related or otherwise reflect shared characteristics, the computed influences in the final model outcome are scored for each feature group rather than for each individual feature. This is especially valuable when the selected split includes a high number of features.
Filtering by Segments¶
Accessible under all three feature diagnostic tabs is a drop-down control called Filter by segments.
In order to apply segment filters, you must first create and define segments (see Managing Segments for guidance).
Assuming you've already defined segments for your model, click Filter by segments (located just below the Overview tab). Here, you'll find your defined Segment groups for the selected model. To reset your filtering criteria at any time, click reset.
Otherwise, take the following steps:
To filter by feature value:
- Click Feature and select one from the list.
Click Filter and select the desired operator. Your choices are as follows:
operation operator equal to == not equal to != less than < less than or equal to <= greater than > greater than or equal to >=
Click Value and enter a number within the defined range.
- Click the search icon to see the results.
To filter by single segment or combined segments:
- Click a single segment and the current display will update accordingly.
- Click one or more additional segments to see the combined results (one OR the other OR both/all).
Click Next below to continue.