Binary Classifier for Spam Detection
Overview
The goal of this tutorial is to use the SliceX Trainer to train a binary text classifier for spam detection. Once training is completed, we will use the model for prediction.
Dataset Processing
For this tutorial we are using a public dataset from Kaggle called SMS Spam Collection Dataset. As it is said on Kaggle, the SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. It can be downloaded at this URL:
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
The first thing to do is to convert this dataset at the required format for the SliceX AI Trainer. Please refer to the Dataset Requirements Section for specific instructions.
Here is a code that processes the spam dataset to the required format:
import pandas as pd
from sklearn.model_selection import train_test_split
#Load the Kaggle CSV with pandas
df = pd.read_csv("spam.csv", sep=',', encoding='latin-1')
#Change columns names and re-arrange them in the right order
df.columns = ["label","sentence", 'a', 'b', 'c']
df = df.drop(columns=['a', 'b', 'c'])
columns_names = ["sentence", "label"]
df = df.reindex(columns=columns_names)
#Change the type of the label column to int - Write the labels to the label_id.json
df['label'].loc[(df['label'] == 'ham')] = 0
df['label'].loc[(df['label'] == 'spam')] = 1
df['label'] = df['label'].astype(int)
#Split data and save the train/test splits to TSV format
df.drop_duplicates(keep='first', inplace=True)
train, test = train_test_split(df, test_size=0.2)
train.to_csv("train.tsv", sep="\t", index=False)
test.to_csv("test.tsv", sep="\t", index=False)
Take the train.tsv and test.tsv files generated from the code above and place it in a folder. You can pick any name you want for this folder. There, add the following JSON files named label_id.json and config.json:
{
"0": "ham",
"1": "spam"
}
{
"desc": "Spam Classification",
"DATASET_NAME": "spam",
"DATA_DIR": "spam",
"TRAIN_FILE": "train.tsv",
"DEV_FILE": "test.tsv",
"field_names": ["sentence", "label"],
"field_header": 0,
"NUM_CLASS_LABELS": 1
}
Here the NUM_CLASS_LABELS is 1 since it’s a binary classification task
Finally, zip the dataset folder.
If you just want the dataset to try a training run, you can download a processed version here: https://drive.google.com/file/d/1FURyD1sXOPSjgQin_V5TB2MzZpLNefVT/view?usp=sharing. Please note that this URL won't directly work for training, as it doesn't respect the specifications described in this tutorial.
Launching the training job
In order to start the training job, we first need to upload the dataset and retrieve the URL. The important aspect is that the URL isn’t protected, when we visit it downloading starts.
To test that it’s working correctly, you can use the following code with the urllib Python library:
import urllib.request
urllib.request.urlretrieve(url_path, "download.zip")
If you are using Google Drive to upload the dataset, make sure it can to be open by anyone (right click on it, then “Get Link”).
You can now use the SliceX Trainer CLI to start the training job.
curl -X 'POST' \
'https://api.slicex.ai/trainer/language/training-jobs' \
-H 'accept: application/json' \
-H 'x-api-key: API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"training_config": {
"batch_size": 64,
"learning_rate": 0.001,
"num_epochs": 2,
"language": "english",
"data_augmentation": false,
"use_pretrained": true
},
"model_config": {
"name": "spam-detection",
"type": "binary-text-classification",
"family": "Papaya",
"size": "mini"
},
"dataset_url": "DATASET_URL"
}'
Once you POST the training request, in the response you will get something that looks like this:
{"id":"XXX"}
This is your model ID that you’ll need to get the training stats and status, or infer with the model. Make sure to save it! In our case it is XXX.
Monitoring the training job
Depending on our servers’ status and utilization rate, your job may not start immediately. You can monitor a training job at any time using the model ID.
If you lost the model ID, you can always find it with the following command:
curl -X GET \
"https://api.slicex.ai/trainer/language/training-jobs" \
-H "accept: application/json" \
-H "x-api-key: API_KEY"
To get the job status, during or after training, you can use this command (here XXX is the model ID):
curl -X 'GET' \
'https://api.slicex.ai/trainer/language/training-jobs/XXX' \
-H 'accept: application/json' \
-H 'x-api-key: API Key' \
-H 'Content-Type: application/json'
In our case, we get the following response:
{"data":{"model":{"id":"XXX","app_id":"API Key","name":"test-hamspam","training_status":"READY","type":"binary-text-classification","modality":"language","created":"2022-08-04T17:55:55.943368+00:00"},"training_stats":{"batch_size":64,"main_eval_metric":"accuracy","best_eval_metric_value":0.9795,"epoch":2,"metrics":{"elapsed_time":[13.890244007110596,13.330918788909912],"training":{"loss":[0.23012849842863423,0.08260126678006989]},"validation":{"accuracy":[0.9697,0.9795],"f1":[0.8119,0.8456]}}},"status":"READY","created":"2022-08-04T17:55:55.943368+00:00"}}
We can use it to plot graphs in Python:
import matplotlib.pyplot as plt
import numpy as np
response = {"data":{"model":{"id":"XXX","app_id":"API Key","name":"test-hamspam","training_status":"READY","type":"binary-text-classification","modality":"language","created":"2022-08-04T17:55:55.943368+00:00"},"training_stats":{"batch_size":64,"main_eval_metric":"accuracy","best_eval_metric_value":0.9795,"epoch":2,"metrics":{"elapsed_time":[13.890244007110596,13.330918788909912],"training":{"loss":[0.23012849842863423,0.08260126678006989]},"validation":{"accuracy":[0.9697,0.9795],"f1":[0.8119,0.8456]}}},"status":"READY","created":"2022-08-04T17:55:55.943368+00:00"}}
fig, ax = plt.subplots()
epochs = np.arange(0, response['data']['training_stats']['epoch'] + 1, 1, dtype=int)
accuracies = [0] + response['data']['training_stats']['metrics']['validation']['accuracy']
ax.plot(epochs, accuracies, label='Accuracy per epoch', marker=".", markersize=5)
plt.ylim(0, 1)
plt.title("Spam Detection - Accuracy per epoch")
ax.set_xticks(epochs)
#ax.set_yticks(accuracies)
ax.legend()
plt.savefig('figure.png')
Use the model for prediction
To use the model in the SliceX AI Predictor, you will need the model ID, in our case it was XXX. Then it’s only a POST request to the api.slicex.ai endpoint:
curl -X 'POST' \
'https://api.slicex.ai/predictor/language/model/XXX' \
-H 'accept: application/json' \
-H 'x-api-key: API Key' \
-H 'Content-Type: application/json' \
-d '{
"query": "You have 1 new voicemail. Please call 08719181513."
}'