Skip to content

Reading Remote Files

TruEra supports ingesting records from remote CSV and Parquet files stored in object storage. The currently supported sources include:

Data Ingestion Basics

TruEra's remote file ingestion functionality relies on a user registering their file as a Table data source. Once, registered tables can be ingested using add_data() and add_production_data() like a dataframe.

The steps to ingest local files into TruEra are:

  1. Configuring the credentials needed, if any, to access the source (see add_credential()).
  2. Add the data source containing the Table object (see add_data_source()).
  3. Add the data in the table in order to create a data split or ingest to a production data stream (see add_data() and add_production_data()).

AWS credentials are needed by TruEra when accessing AWS resources on your behalf. One option is to provide access keys to your IAM user, but this approach is not recommended. Instead, we recommend using IAM roles to manage TruEra's access to your AWS resources. Visit the AWS documentation to read more about delegating access across AWS accounts using IAM roles.

Using AWS IAM Roles in TruEra

To get started using IAM Roles in TruEra, we first need to create a credential in the system.

>>> credential = tru.add_credential(
...     name="iam_role_credential",
...     secret=None,
...     identity=None,
...     is_aws_iam_role=True
... )
INFO:truera.client.remote_truera_workspace:Principal AWS Account for trust policy: 123456789012.
>>> credential.id
'cc503ab0-ffa2-4fdc-bcd2-1afe62d749a7'

We need to note the Principal AWS Account (123456789012) as well as the credential id (cc503ab0-ffa2-4fdc-bcd2-1afe62d749a7), as these will be used in the trust policy of the IAM role. The credential id will be provided as the external id by TruEra when assuming the role in AWS, so it is recommended that you limit access based on external id.

Here is what an example trust policy would look like.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "123456789012"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "cc503ab0-ffa2-4fdc-bcd2-1afe62d749a7"
        }
      }
    }
  ]
}

Assuming an IAM role has been created with the appropriate permissions and the trust policy above, we can now update the credential. Note the role's ARN, which will look like arn:aws:iam::123456789012:role/ExampleRole. We will need to update our credential's secret to be this ARN.

>>> credential = tru.get_ingestion_client().update_credential(
...     name="iam_role_credential",
...     secret="arn:aws:iam::123456789012:role/ExampleRole",
...     identity=None
... )

Now that our credential is updated, we are ready to use it!

data_source = tru.add_data_source(
  "s3_data_source",
   uri="s3://bucket/path/to/data.csv",
   credential=credential
)
tru.add_data(
    data_source,
    data_split_name="split_1",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=["feature1", "feature2"]
    )
)

Click Next below to continue.