Ingesting Production Data¶

For most real-time monitoring use cases, a data pipeline delivers the data to be processed by your model. Data can be pushed in batch or pulled into the pipeline on a defined schedule.

Batch ingestion collects and transfers data to TruEra in batches, either at scheduled intervals or asynchronously, on demand. This is especially useful when you need to model specific data points on a daily basis or when the ingested data can be assembled in microbatches. Historically, most near real-time monitoring is done via microbatches.

By contrast, a SageMaker Scheduled Pull entails a trained model deployed at a SageMaker endpoint. Data Capture then writes both incoming inference data and model predictions into a specified S3 location. You can set up TruEra Monitoring in this context for scheduled ingestion, which at defined intervals will read from and parse the data capture logs. The parsed data is then ingested and appended to a data split, which then can be consumed by downstream services.

Data for both types of production data ingestion — batch push or scheduled pull — come in essentially two flavors:

Structured data or tabular data — data in a database or data warehouse, commonly known for being highly organized so that it can be easily searched, changed, and analyzed.
Unstructured data — typically rich media like long-form text, audio and video accounts for 80% of data in enterprises and is often difficult to manage, store, and analyze because it doesn’t have a predefined format or structure, barring the capability to automatically organize information.

Essentially, TruEra monitoring dashboards can handle models of any type focused on model output, labels, custom metrics, and segment tags, although tracking data drift and data quality for tabular inputs is currently supported.

Click a link below to explore your options and determine what's suitable for your particular use case.

Or click Next below to continue.