Reading Data from Data Stores¶
TruEra supports a range data stores, from remote databases to data warehouses to a file or folder in a blob store. Currently, supported sources include:
TruEra's Python SDK also supports upload and manipulation of data local to the caller's machine. This is useful for larger local inputs, as well as for learning the programming model without having any data in cloud storage.
Data Ingestion Basics¶
TruEra's remote data ingestion functionality provides a lightweight, interactive framework with automated tools supporting three primary operations — data sourcing, filtering, and output — each involving tables.
Placed into context, the stages of remote data ingestion involve:
- Configuring the credentials needed, if any, to access the source (see
add_credential()
). - Adding/identiftying the data source containing the Table object (see
add_data_source()
). - Adding the data in the table in order to create a data split (see
add_data()
).
Before going further, let's quickly review the basic constructs:
TABLES¶
A table is a logical set of rows. A row is an ordered collection of fields. Tables can be of any size, comprise zero or more rows, and have a schema. They can come directly from a data store or be derived from another table. A table exists in one of three states — currently available, still processing, or failed. A table in a failed state throws an error. A table cannot be used by TruEra analytics — diagnostic, monitoring, and reports — until it is used to create a split.
DATA SOURCES¶
Data read from a data source creates a table. A data source can be any supported input type. Adding a data source creates a new table referenced by ID or the name given to the data source. A table created from a data source becomes a root table, unmodified by a filter or any other operation.
When reading from a data store requires credentials, these can be provided to TruEra either when reading is initiated or created ahead of time for the particular read. Storage of credentials can be configured to use either vault or an encryption key during TruEra platform setup.
FILTERS¶
A filter is applied to a table to return a new table limited to rows meeting the filtering condition. For example, filtering Table-A for rows where Column-1 < 5 returns Table-B containing all rows in A for which the filtering condition (Column-1 < 5) is true.
For the list of TruEra-supported SQL expressions for filtering, click here.
OUTPUT¶
A table ready for use by TruEra can be used to create a split. An operation ID is returned to query the status of the filtering operation. Options are also available for handling this synchronously.
To summarize, the best-practice flow for remote data ingestion entails:
- Creating a root table from your data source.
- Applying filters to the table and/or making queries about its rows/schema to formulate filters.
- Creating one or more splits from a filtered table.
The specific process steps for remote data ingestion vary between the CLI, Python SDK, and the TruEra Web App. Each is explained next.
Remote Ingestion from a Data Store using the Python SDK¶
The basic ingestion classes used to create and filter tables, then generate splits, comprise:
INGESTION CLIENT¶
This class contains most of the functionality for creating tables and materializing splits from them. Ingestion client is returned from a call to get_ingestion_client()
in the TruEra workspace. The complete list of functions can be found in the Python client reference. The ingestion client also supports creating and deleting credentials.
TABLE¶
This class represents the table itself in TruEra. Table objects support calls to filtering, getting the schema, and fetching sample rows returned in a data frame. See the Python client reference for complete details on supported Table class functions.
CREDENTIAL¶
This class is returned by a call to add_credential()
and contains a Credential
object containing a name
and identity
correlating which TruEra projects are permitted to use the credential to access the data source.
To ingest remote data:
Create an initial root table using TrueraWorkspace.add_data_source()
. You can optionally provide a Credential object, when necessary. With the resulting Table object, you can filter()
and get_sample_rows()
. Use the TrueraWorkspace.add_data()
method to add a data split.
Click Next below to contunue.