NLP Setup, Training, Ingestion, and Analysis via TensorFlow 2¶

Guidance on using TruEra's Python SDK to develop NLP models using a TensorFlow2 framework is covered in this notebook.

First, a sample Tensorflow 2 BERT model is trained on the Covid-19 Tweets sample dataset. The SDK then computes explanations for each record. Finally, model, data, and explanations are ingested into a TruEra workspace, enabling analysis via TruEra's notebook widgets and web interface.

Each step in these instructions easily adapts to your own model or dataset.

Step 1. General Setup¶

Setup entails installing the required dependencies, installing the TruEra Python SDK (if not already installed), and setting custom size limits.

A. Install Dependencies¶

Install the following dependencies to your Python environment:

# Install visualization dependencies
pip3 install "plotly>5.5.0" "ipywidgets>7.7.0" "notebook>6.4.0"

You may need to restart your notebook after installing these dependencies.

B. Install the TruEra Python SDK and Download Sample Data¶

First, install the TruEra Python SDK with the command line: pip3 install "truera[nlp]"

Don't have access to pip?

Download and install the TruEra Python package from app.truera.net (see Download the TruEra SDK).

Next, download the NLP Quickstart Data in TAR or ZIP from the Account Settings > Resources page (see Downloading Samples).

Now Extract the NLP Quickstart Data using either tar -xvf nlp_quickstart_data.tgz or unzip nlp_quickstart_data.zip

Then install the wheel in your Python environment using pip install 'truera-*.whl[nlp]'.

Careful

This will only work from the directory in which the whl is located.

C. Customizations¶

Although it is recommended that you use GPU whenever possible for NLP models, here we'll use small samples running on CPU for demo purposes only.

# Using small size limits for demo purposes only
demo_train_records = 100
demo_test_records = 100
truera_batch_size = 16
demo_influences = 4
train_epochs = 1

Step 2. Train your NLP Model¶

The example training code that follows is strictly for demonstration purposes. Adapt/replace it with your own model and code, when ready.

A. Import Dependencies¶

import os
import numpy as np
import pandas as pd
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

B. Prep the Data¶

QUICKSTART_DATA_DIRECTORY = "<QUICKSTART_DATA_DIRECTORY>"

BATCH_SIZE = 32

train_df_path = os.path.join(QUICKSTART_DATA_DIRECTORY, 'covid_tweets', 'train', 'dataset.csv')
test_df_path = os.path.join(QUICKSTART_DATA_DIRECTORY, 'covid_tweets', 'val_test', 'dataset.csv')

train_data = pd.read_csv(train_df_path, encoding='L1')
test_data = pd.read_csv(test_df_path,encoding='L1')

train_data.head()

def clean_data(df: pd.DataFrame, stopwords: set[str], n_records: int = None):
    """Clean the data by removing stopwords and converting sentiment labels to binary class IDs."""

    # Apply record limit
    if n_records is not None:
        df = df[:n_records].copy()

    # Remove rows with missing values
    df.dropna(subset=['OriginalTweet', 'Sentiment'], inplace=True)

    # Remove stopwords
    stopwords_check = lambda word: word.lower() not in stopwords
    text_preprocess_fn = lambda text: " ".join(filter(stopwords_check, str(text).split()))
    df['OriginalTweet'] = df['OriginalTweet'].apply(text_preprocess_fn)

    # Convert sentiment labels to binary class IDs
    sentiment_tol = {
        'Extremely Negative': 0,
        'Negative': 0,
        'Neutral': 0,
        'Positive': 1,
        'Extremely Positive': 1
    }
    process_labels_fn = lambda sentiments: [sentiment_tol[sentiment] for sentiment in sentiments]
    df['Sentiment'] = process_labels_fn(df['Sentiment'])
    return df

# Preprocess data
sw_nltk = set(stopwords.words('english'))
train_data = clean_data(train_data, sw_nltk, demo_train_records)
test_data = clean_data(test_data, sw_nltk, demo_test_records)

train_df = train_data.sample(frac=0.7, random_state=0)  # random state is a seed value
val_df = train_data.drop(train_df.index)
test_df = test_data

# Convert text to numpy arrays
train_texts = np.array(train_df['OriginalTweet'])
val_texts = np.array(val_df['OriginalTweet'])
test_texts = np.array(test_df['OriginalTweet'])

# Convert labels to numpy arrays
train_labels = np.array(train_df['Sentiment'])
val_labels = np.array(val_df['Sentiment'])
test_labels = np.array(test_df['Sentiment'])

C. Create a Tokenizer and a Model¶

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text  # registers operations for NLP models on Tensorflow Hub

EMBEDDING_DIM = 256
SEQ_LENGTH = 128
MODEL_NAME = "covid_tweets_sentiment"

tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/2'

def get_tokenizer():
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessor = hub.load(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")

    tokenize = hub.KerasLayer(preprocessor.tokenize)
    tokenized_input = tokenize(text_input)

    bert_pack_inputs = hub.KerasLayer(
        preprocessor.bert_pack_inputs,
        arguments=dict(seq_length=SEQ_LENGTH))  # Optional argument.
    encoder_inputs = bert_pack_inputs([tokenized_input])
    return tf.keras.Model(text_input, encoder_inputs)

tokenizer = get_tokenizer()

def bert_text_classifier():

    # - text input -
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')

    # - encoding -
    encoder_inputs = tokenizer(text_input)

    # - bert encoder -
    bert_encoder = hub.KerasLayer(
        tfhub_handle_encoder, trainable=True, name='BERT_encoder'
    )

    # - bert output -
    outputs = bert_encoder(encoder_inputs)

    # - classifier layer -
    net = outputs['sequence_output']
    conv_1 = tf.keras.layers.Conv1D(
        16, 3, activation='relu', input_shape=(128, 512)
    )(outputs['sequence_output'])
    max_pool = tf.keras.layers.MaxPool1D(pool_size=2)(conv_1)
    flatten = tf.keras.layers.Flatten()(max_pool)
    lin1 = tf.keras.layers.Dense(128, activation="relu",
                                    name='lin1')(flatten)
    lin2 = tf.keras.layers.Dense(1, activation=None,
                                    name='lin2')(lin1)
    sigmoid = tf.keras.activations.sigmoid(lin2)

    model = tf.keras.Model(text_input, sigmoid)
    embedder = tf.keras.Model(text_input, lin1)
    return model, embedder

model, embedder = bert_text_classifier()
model.summary()

D. Define a Sentence Embedding Function (optional)¶

from typing import Callable, Iterable

def get_embeddings(embedder: Callable, texts: Iterable[str]):
    embedding_batches = []
    for start_idx in range(0, len(texts), BATCH_SIZE):
        batch_texts = texts[start_idx:start_idx+BATCH_SIZE]
        embedding_batches.append(embedder(batch_texts).numpy().tolist())
    return np.concatenate(embedding_batches, axis=0).tolist()

E. Train¶

from official.nlp import optimization

LEARNING_RATE = 1e-05

# -- Optimizer --
# will use the same optimizer that BERT was originally trained with: the "Adaptive Moments" (Adam).
train_data_size = len(train_texts)
steps_per_epoch = int(train_data_size / BATCH_SIZE)
num_train_steps = steps_per_epoch * train_epochs
num_warmup_steps = int(0.1 * num_train_steps / BATCH_SIZE)

optimizer = optimization.create_optimizer(
    init_lr=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    optimizer_type='adamw'
)

# -- compile the model --
loss = tf.keras.losses.BinaryCrossentropy(
    from_logits=False,
    name='binary_crossentropy'
)

model.compile(
    loss=loss, optimizer=optimizer, metrics=['accuracy']
)

history = model.fit(
    x=train_texts,
    y=train_labels,
    validation_data=(val_texts, val_labels),
    epochs=train_epochs,
    validation_steps=2,
    verbose=1,
    batch_size=BATCH_SIZE
)

for key in history.history:
    print(f"{key}: {history.history[key][-1]}")

Step 3. Generate Explanations with the TruEra Explainer¶

The Explainer object returns feature influences from your dataset, model, and tokenizer.

After initializing, call the following methods to ingest relevant data to the explainer:

NLPExplainer.set_model() – provide the model and tokenizer object. Tensorflow, PyTorch, and Huggingface models are supported.
NLPExplainer.set_data() – provide text dataset (iterable of strings), labels (iterable of integers), and a unique data_split_name (str)
NLPExplainer.config() – sets parameters that configure influence computation

The Explainer will attempt to infer other attributes from the user-supplied arguments, but may request additional arguments depending on what is inferable.

NLPExplainer.explain() computes feature influences.

from uuid import uuid4
from truera.client.nn.explain import NLPExplainer

e = NLPExplainer()
e.set_model(
    model, 
    tokenizer=tokenizer,
    token_embeddings_layer='BERT_encoder/bert_encoder/word_embeddings',
    model_name=MODEL_NAME
)

e.set_data(    
    val_texts, 
    labels=val_labels, 
    data_split_name=f"covid_{uuid4()}"
)
e.config(n_metrics_records=len(val_texts))
df = e.explain(start=0, stop=len(val_texts), token_type="token")
df.head()

Step 4. Setup a TruEra Project¶

A. Connect to the TruEra Endpoint¶

Provide your Truera deployment URL (http://app.truera.net).
Provide your generated token.
Truera Workspace creation will verify your connectivity to Truera services.

TRUERA_URL = "<TRUERA_URL>"
TOKEN = "<TRUERA_TOKEN>"

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

# Replace with your URLs, credentials 
auth = TokenAuthentication(TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

B. Create a TruEra Project¶

Add an NLP Project to TruEra with a data collection and a model.

PROJECT_NAME = "COVID Tweets Sentiment Classification"
DATA_COLLECTION_NAME = "data_collection"
SPLIT_NAME = "test_nlp_split_tf"

# Setup remote environment
tru.set_model_execution("remote")
tru.add_project(PROJECT_NAME, score_type="probits", input_type="text")
tru.add_data_collection(DATA_COLLECTION_NAME)
tru.add_model(MODEL_NAME)
tru

C. Prepare Split Data¶

Add sentence embeddings to the dataframe:

df['embeddings'] = get_embeddings(embedder, val_texts)
df.head()

Specify NLPColumnSpec to map different columns used by TruEra to columns from your ingested dataframe:

from truera.client.ingestion import NLPColumnSpec

column_spec = NLPColumnSpec(
    id_col_name="original_index",
    text_col_name="text",
    prediction_col_name="preds",
    label_col_name="labels",
    token_influence_col_name="influences",
    tokens_col_name="tokens",
    sentence_embeddings_col_name="embeddings"
)

Specify ModelOutputContext to store metadata about the model and, if feature influences are being ingested, the influence type:

from truera.client.ingestion import ModelOutputContext

model_output_context = ModelOutputContext(
    model_name=MODEL_NAME,
    score_type="probits",
    influence_type="integrated-gradients"
)

D. Ingest via `tru.add_data()`¶

tru.add_data(
    data=df,
    data_split_name=SPLIT_NAME,
    column_spec=column_spec,
    model_output_context=model_output_context,
)

Step 5. Run the Analysis¶

The available performance metrics can be shown with list_performance_metrics, based on project score type. You can pass a specific metric to compute_performance() or, if nothing is passed, all metrics are shown.

tru.get_explainer().list_performance_metrics()

tru.get_explainer().compute_performance()

If the following widgets do not load, check the required dependencies.

plotly ≥ 5.5
ipywidgets ≥ 7.7
notebook ≥ 6.4

Important

The kernel alone having these dependencies is insufficient. Make sure the Jupyter server's environment also has these dependencies installed.

You may need to close and reopen this notebook for changes to work the first time the server is started with these dependencies installed.

For Global Explanations

tru.get_explainer().global_token_summary()

For Record Explanations

tru.get_explainer().record_explanations_attribution_tab()