Data Ingestion¶

Ingesting data means importing large data files from multiple sources into a single, cloud-based storage medium — a data warehouse, data mart, or database — from which it can be accessed and analyzed by TruEra.

First things first

Remember to first create a project before attempting to ingest data. Review the Quickstart for an brief overview of the ingestion process using sample data.

To ingest project data, TruEra supports the following methods:

In terms of task breakout, these comprise, at minimum:

Pre-deployment

Feature Development – transforming raw data into features that better represent the underlying problem, resulting in improved model accuracy on unseen data.
Model Training – fitting the best combination of weights and bias to minimize loss functions over the prediction range.

Post-deployment

Logging Inputs and Predictions – classifying whether a particular log event, or set of events, is causing a real incident that requires attention.
Logging Additional Metrics – score tracking to determine real accuracy and improvement —
F1
Measures a model's accuracy by combining its precision and recall scores; computes how many times a model made a correct prediction across the entire dataset.
,
F2
Weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F2 score gives more weight to recall than to precision
,
brier_loss
Measures the mean squared difference between the predicted probability and the actual outcome.
; iteration-level metrics (learning curves); predictions after every epoch; and updated experiment metrics among many others.

Normalize – by applying a mathematical function to every data point to handle highly skewed data.
Standardize – by converting the data to a uniform format as a way of handling data with differing units (e.g., converting US Custom and British Imperial measurements to metric).
Encode – by converting categorical variables into numerical values so that they can be more easily fitted to the model.

ML models require input and output variables to be numeric. Even so, the more simplified the encoding the better; i.e., encoding by shared/common characteristic(s) where feasible and meaningful.

Above all, have a plan for tracking and handling your model's results, both expected results and unexpected, so you can refine and improve your data and model all along your path to ultimate success.

Click Next below to continue.