Text Classification

View in Colab GitHub source

!pip install autokeras

import os

import keras
import numpy as np
import tensorflow as tf
from sklearn.datasets import load_files

import autokeras as ak

A Simple Example

The first step is to prepare your data. Here we use the IMDB dataset as an example.

dataset = keras.utils.get_file(
    fname="aclImdb.tar.gz",
    origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
    extract=True,
)

# set path to dataset
IMDB_DATADIR = os.path.join(os.path.dirname(dataset), "aclImdb")

classes = ["pos", "neg"]
train_data = load_files(
    os.path.join(IMDB_DATADIR, "train"), shuffle=True, categories=classes
)
test_data = load_files(
    os.path.join(IMDB_DATADIR, "test"), shuffle=False, categories=classes
)

x_train = np.array(train_data.data)[:100]
y_train = np.array(train_data.target)[:100]
x_test = np.array(test_data.data)[:100]
y_test = np.array(test_data.target)[:100]

print(x_train.shape)  # (25000,)
print(y_train.shape)  # (25000, 1)
print(x_train[0][:50])  # this film was just brilliant casting

The second step is to run the TextClassifier. As a quick demo, we set epochs to 2. You can also leave the epochs unspecified for an adaptive number of epochs.

# Initialize the text classifier.
clf = ak.TextClassifier(
    overwrite=True, max_trials=1
)  # It only tries 1 model as a quick demo.
# Feed the text classifier with training data.
clf.fit(x_train, y_train, epochs=1, batch_size=2)
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))

Validation Data

By default, AutoKeras use the last 20% of training data as validation data. As shown in the example below, you can use validation_split to specify the percentage.

clf.fit(
    x_train,
    y_train,
    # Split the training data and use the last 15% as validation data.
    validation_split=0.15,
    epochs=1,
    batch_size=2,
)

You can also use your own validation set instead of splitting it from the training data with validation_data.

split = 5
x_val = x_train[split:]
y_val = y_train[split:]
x_train = x_train[:split]
y_train = y_train[:split]
clf.fit(
    x_train,
    y_train,
    epochs=1,
    # Use your own validation set.
    validation_data=(x_val, y_val),
    batch_size=2,
)

Customized Search Space

For advanced users, you may customize your search space by using AutoModel instead of TextClassifier. You can configure the TextBlock for some high-level configurations. You can also do not specify these arguments, which would leave the different choices to be tuned automatically. See the following example for detail.

input_node = ak.TextInput()
output_node = ak.TextBlock()(input_node)
output_node = ak.ClassificationHead()(output_node)
clf = ak.AutoModel(
    inputs=input_node, outputs=output_node, overwrite=True, max_trials=1
)
clf.fit(x_train, y_train, epochs=1, batch_size=2)

Data Format

The AutoKeras TextClassifier is quite flexible for the data format.

For the text, the input data should be one-dimensional For the classification labels, AutoKeras accepts both plain labels, i.e. strings or integers, and one-hot encoded encoded labels, i.e. vectors of 0s and 1s.

We also support using tf.data.Dataset format for the training data.

train_set = tf.data.Dataset.from_tensor_slices(((x_train,), (y_train,))).batch(
    2
)
test_set = tf.data.Dataset.from_tensor_slices(((x_test,), (y_test,))).batch(2)

clf = ak.TextClassifier(overwrite=True, max_trials=1)
# Feed the tensorflow Dataset to the classifier.
clf.fit(train_set.take(2), epochs=1)
# Predict with the best model.
predicted_y = clf.predict(test_set.take(2))
# Evaluate the best model with testing data.
print(clf.evaluate(test_set.take(2)))

Reference

TextClassifier, AutoModel, ConvBlock, TextInput, ClassificationHead.