Reading Data from Data Stores¶
TruEra's Python SDK also supports upload and manipulation of data local to the caller's machine. This is useful for larger local inputs, as well as for learning the programming model without having any data in cloud storage.
Data Ingestion Basics¶
TruEra's remote data ingestion functionality provides a lightweight, interactive framework with automated tools supporting three primary operations — data sourcing, filtering, and output — each involving tables.
A table is a logical set of rows. A row is an ordered collection of fields. Tables can be of any size, comprise zero or more rows, and have a schema. They can come directly from a data store or be derived from another table. A table exists in one of three states — currently available, still processing, or failed. A table in a failed state throws an error. A table cannot be used by TruEra analytics — diagnostic, monitoring, and reports — until it is used to create a split.
Data read from a data source creates a table. A data source can be any supported input type. Adding a data source creates a new table referenced by ID or the name given to the data source. A table created from a data source becomes a root table, unmodified by a filter or any other operation.
When reading from a data store requires credentials, these can be provided to TruEra either when reading is initiated or created ahead of time for the particular read. Storage of credentials can be configured to use either vault or an encryption key during TruEra platform setup.
A filter is applied to a table to return a new table limited to rows meeting the filtering condition. For example, filtering Table-A for rows where Column-1 < 5 returns Table-B containing all rows in A for which the filtering condition (Column-1 < 5) is true.
TruEra-supported SQL expressions for filtering are listed in the following table.
|Operator||Comparison||Example Filter Expression|
||operand EQUALS literal||
||operand NOT EQUAL to literal||
||operand LESS THAN literal||
||operand LESS THAN OR EQUAL to literal||
||operand GREATER THAN literal||
||operand GREATER THAN OR EQUAL to literal||
||operand in both expressions must be TRUE||
||operand in either or both expressions is TRUE||
||reverses logical value of combined expressions||
A Boolean operator logically compares an operand to a literal producing a true/false result. The operand is always a column name (without quotation marks), is always on the left side of the expression, and represents the value of the table cell at coordinates row:col. A literal is a constant/unchanging value and is always on the right side of an expression. String literals must be enclosed in quotation marks (
"). Do not use quotation marks for numeric literals.
amount < salary is invalid and will throw an exception because both the left and right side of the operator are column names.
A table ready for use by TruEra can be used to create a split. An operation ID is returned to query the status of the filtering operation. Options are also available for handling this synchronously.
To summarize, the best-practice flow for remote data ingestion entails:
- Creating a root table from your data source.
- Applying filters to the table and/or making queries about its rows/schema to formulate filters.
- Creating one or more splits from a filtered table.
The specific process steps for remote data ingestion vary between the CLI, Python SDK, and the TruEra Web App. Each is explained next.
Remote Ingestion from a Data Store using the Python SDK¶
The basic ingestion classes used to create and filter tables, then generate splits, comprise:
This class contains most of the functionality for creating tables and materializing splits from them. Ingestion client is returned from a call to
get_ingestion_client() in the TruEra workspace. The complete list of functions can be found in the Python client reference. The ingestion client also supports creating and deleting credentials.
This class represents the table itself in TruEra. Table objects support calls to filtering, getting the schema, and fetching sample rows returned in a data frame. See the Python client reference for complete details on supported Table class functions.
This class is returned by a call to
add_credential() and contains a
Credential object containing a
identity correlating which TruEra projects are permitted to use the credential to access the data source.
To ingest remote data:
Create an initial root table using
TrueraWorkspace.add_data_source(). You can optionally provide a Credential object, when necessary. With the resulting Table object, you can
get_sample_rows(). Use the
TrueraWorkspace.add_data() method to add a data split.
Remote Ingestion from a Data Store using the TruEra Web App¶
The Web App supports remote ingestion from three data source types — Microsoft Azure storage blob, Amazon S3 bucket, and MySQL database. File types can be either parquet or CSV.
To ingest remote data from the Web App:
Select Data Management from the Web App's left-side nav menu, then click NEW DATA SOURCE.
Click the Data source type drop-down and select the appropriate source.
Enter a Data source name, URI, and the necessary Credentials to access the source.
From the File type drop-down and select a file format, either parquet or csv**.
If File type is csv, enter its Text extraction parameters.
Click NEW DATA SOURCE.
Click Next below to contunue.