Text Classification
!pip install autokeras
import os
import keras
import numpy as np
import tensorflow as tf
from sklearn.datasets import load_files
import autokeras as ak
A Simple Example
The first step is to prepare your data. Here we use the IMDB dataset as an example.
dataset = keras.utils.get_file(
fname="aclImdb.tar.gz",
origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
extract=True,
)
# set path to dataset
IMDB_DATADIR = os.path.join(os.path.dirname(dataset), "aclImdb")
classes = ["pos", "neg"]
train_data = load_files(
os.path.join(IMDB_DATADIR, "train"), shuffle=True, categories=classes
)
test_data = load_files(
os.path.join(IMDB_DATADIR, "test"), shuffle=False, categories=classes
)
x_train = np.array(train_data.data)[:100]
y_train = np.array(train_data.target)[:100]
x_test = np.array(test_data.data)[:100]
y_test = np.array(test_data.target)[:100]
print(x_train.shape) # (25000,)
print(y_train.shape) # (25000, 1)
print(x_train[0][:50]) # this film was just brilliant casting
The second step is to run the TextClassifier. As a quick demo, we set epochs to 2. You can also leave the epochs unspecified for an adaptive number of epochs.
# Initialize the text classifier.
clf = ak.TextClassifier(
overwrite=True, max_trials=1
) # It only tries 1 model as a quick demo.
# Feed the text classifier with training data.
clf.fit(x_train, y_train, epochs=1, batch_size=2)
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))
Validation Data
By default, AutoKeras use the last 20% of training data as validation data. As
shown in the example below, you can use validation_split
to specify the
percentage.
clf.fit(
x_train,
y_train,
# Split the training data and use the last 15% as validation data.
validation_split=0.15,
epochs=1,
batch_size=2,
)
You can also use your own validation set instead of splitting it from the
training data with validation_data
.
split = 5
x_val = x_train[split:]
y_val = y_train[split:]
x_train = x_train[:split]
y_train = y_train[:split]
clf.fit(
x_train,
y_train,
epochs=1,
# Use your own validation set.
validation_data=(x_val, y_val),
batch_size=2,
)
Customized Search Space
For advanced users, you may customize your search space by using AutoModel instead of TextClassifier. You can configure the TextBlock for some high-level configurations. You can also do not specify these arguments, which would leave the different choices to be tuned automatically. See the following example for detail.
input_node = ak.TextInput()
output_node = ak.TextBlock()(input_node)
output_node = ak.ClassificationHead()(output_node)
clf = ak.AutoModel(
inputs=input_node, outputs=output_node, overwrite=True, max_trials=1
)
clf.fit(x_train, y_train, epochs=1, batch_size=2)
Data Format
The AutoKeras TextClassifier is quite flexible for the data format.
For the text, the input data should be one-dimensional For the classification labels, AutoKeras accepts both plain labels, i.e. strings or integers, and one-hot encoded encoded labels, i.e. vectors of 0s and 1s.
We also support using tf.data.Dataset format for the training data.
train_set = tf.data.Dataset.from_tensor_slices(((x_train,), (y_train,))).batch(
2
)
test_set = tf.data.Dataset.from_tensor_slices(((x_test,), (y_test,))).batch(2)
clf = ak.TextClassifier(overwrite=True, max_trials=1)
# Feed the tensorflow Dataset to the classifier.
clf.fit(train_set.take(2), epochs=1)
# Predict with the best model.
predicted_y = clf.predict(test_set.take(2))
# Evaluate the best model with testing data.
print(clf.evaluate(test_set.take(2)))
Reference
TextClassifier, AutoModel, ConvBlock, TextInput, ClassificationHead.