Skip to content

Batch Ingestion Reliability

TruEra's batch ingestion APIs support idempotency for retrying ingestion jobs without accidentally running the job twice and duplicating your data. When executing a batch ingestion, simply supply an idempotency id to add_data() or add_production_data(). If a connection error occurs at some point in your ingestion, you can retry the ingestion job using the same idempotency id. If you retry the ingestion with an already saved idempotency id, TruEra will fail the duplicate request. You can query the status of your job using your idempotency id to determine whether the first ingestion succeeded or not.

This is expecially useful when TruEra batch ingestion jobs are integrated into automated model training and batch model inference pipelines. Users can build workflows to retry ingestions and check ingestion statuses to safely re-run entire pipelines. This also protects users when they accidentally run the same ingestion jobs twice on accident.

Idempotency ids are scoped to projects. Ingestion jobs across two different TruEra project can use the same idempotency id, but two ingestions associated with the same project cannot share an idempotency id.

Choosing an idempotency ID

There's no one size fits all approach to choosing an idempotency ID, as it depends heavily on your integration's patterns. Customers have been successful using:

  • GUIDs for training/inference jobs
  • Start timestamps for training/inference jobs with a suffix representing a unique model id
  • Names of local or remote files being ingested

How to supply and query jobs status by idempotency ID

Supply an idempotency id:

tru.add_production_data(
    data=...,
    column_spec=...,
    idempotency_id = "batch_inference_job_guid"
)

Retrieving an ingestion job's status by idempotency id:

status = tru.get_ingestion_operation_status(
    idempotency_id = "batch_inference_job_guid"
)
Click Next below to continue.