Skip to main content

Binary Classifier for Spam Detection

Overview

The goal of this tutorial is to use the SliceX Trainer to train a binary text classifier for spam detection. Once training is completed, we will use the model for prediction.

Dataset Processing

For this tutorial we are using a public dataset from Kaggle called SMS Spam Collection Dataset. As it is said on Kaggle, the SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. It can be downloaded at this URL:

https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

The first thing to do is to convert this dataset at the required format for the SliceX AI Trainer. Please refer to the Dataset Requirements Section for specific instructions.

Here is a code that processes the spam dataset to the required format:

Dataset Preprocessing in Python
import pandas as pd
from sklearn.model_selection import train_test_split

#Load the Kaggle CSV with pandas
df = pd.read_csv("spam.csv", sep=',', encoding='latin-1')

#Change columns names and re-arrange them in the right order
df.columns = ["label","sentence", 'a', 'b', 'c']
df = df.drop(columns=['a', 'b', 'c'])
columns_names = ["sentence", "label"]
df = df.reindex(columns=columns_names)

#Change the type of the label column to int - Write the labels to the label_id.json
df['label'].loc[(df['label'] == 'ham')] = 0
df['label'].loc[(df['label'] == 'spam')] = 1
df['label'] = df['label'].astype(int)

#Split data and save the train/test splits to TSV format
df.drop_duplicates(keep='first', inplace=True)
train, test = train_test_split(df, test_size=0.2)
train.to_csv("train.tsv", sep="\t", index=False)
test.to_csv("test.tsv", sep="\t", index=False)

Take the train.tsv and test.tsv files generated from the code above and place it in a folder. You can pick any name you want for this folder. There, add the following JSON files named label_id.json and config.json:

label_id.json
{
"0": "ham",
"1": "spam"
}
config.json
{
"desc": "Spam Classification",
"DATASET_NAME": "spam",
"DATA_DIR": "spam",
"TRAIN_FILE": "train.tsv",
"DEV_FILE": "test.tsv",
"field_names": ["sentence", "label"],
"field_header": 0,
"NUM_CLASS_LABELS": 1
}
info

Here the NUM_CLASS_LABELS is 1 since it’s a binary classification task

Finally, zip the dataset folder.

note

If you just want the dataset to try a training run, you can download a processed version here: https://drive.google.com/file/d/1FURyD1sXOPSjgQin_V5TB2MzZpLNefVT/view?usp=sharing. Please note that this URL won't directly work for training, as it doesn't respect the specifications described in this tutorial.

Launching the training job

In order to start the training job, we first need to upload the dataset and retrieve the URL. The important aspect is that the URL isn’t protected, when we visit it downloading starts.

To test that it’s working correctly, you can use the following code with the urllib Python library:

config.json
import urllib.request
urllib.request.urlretrieve(url_path, "download.zip")
tip

If you are using Google Drive to upload the dataset, make sure it can to be open by anyone (right click on it, then “Get Link”).

You can now use the SliceX Trainer CLI to start the training job.

Launching the training job
curl -X 'POST' \
'https://api.slicex.ai/trainer/language/training-jobs' \
-H 'accept: application/json' \
-H 'x-api-key: API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"training_config": {
"batch_size": 64,
"learning_rate": 0.001,
"num_epochs": 2,
"language": "english",
"data_augmentation": false,
"use_pretrained": true
},
"model_config": {
"name": "spam-detection",
"type": "binary-text-classification",
"family": "Papaya",
"size": "mini"
},
"dataset_url": "DATASET_URL"
}'

Once you POST the training request, in the response you will get something that looks like this: {"id":"XXX"}

This is your model ID that you’ll need to get the training stats and status, or infer with the model. Make sure to save it! In our case it is XXX.

Monitoring the training job

note

Depending on our servers’ status and utilization rate, your job may not start immediately. You can monitor a training job at any time using the model ID.

If you lost the model ID, you can always find it with the following command:

Get model list
curl -X GET \
"https://api.slicex.ai/trainer/language/training-jobs" \
-H "accept: application/json" \
-H "x-api-key: API_KEY"

To get the job status, during or after training, you can use this command (here XXX is the model ID):

Get job status
curl -X 'GET' \
'https://api.slicex.ai/trainer/language/training-jobs/XXX' \
-H 'accept: application/json' \
-H 'x-api-key: API Key' \
-H 'Content-Type: application/json'

In our case, we get the following response:

Job status response
{"data":{"model":{"id":"XXX","app_id":"API Key","name":"test-hamspam","training_status":"READY","type":"binary-text-classification","modality":"language","created":"2022-08-04T17:55:55.943368+00:00"},"training_stats":{"batch_size":64,"main_eval_metric":"accuracy","best_eval_metric_value":0.9795,"epoch":2,"metrics":{"elapsed_time":[13.890244007110596,13.330918788909912],"training":{"loss":[0.23012849842863423,0.08260126678006989]},"validation":{"accuracy":[0.9697,0.9795],"f1":[0.8119,0.8456]}}},"status":"READY","created":"2022-08-04T17:55:55.943368+00:00"}}

We can use it to plot graphs in Python:

Plot graphs with the response data
import matplotlib.pyplot as plt
import numpy as np

response = {"data":{"model":{"id":"XXX","app_id":"API Key","name":"test-hamspam","training_status":"READY","type":"binary-text-classification","modality":"language","created":"2022-08-04T17:55:55.943368+00:00"},"training_stats":{"batch_size":64,"main_eval_metric":"accuracy","best_eval_metric_value":0.9795,"epoch":2,"metrics":{"elapsed_time":[13.890244007110596,13.330918788909912],"training":{"loss":[0.23012849842863423,0.08260126678006989]},"validation":{"accuracy":[0.9697,0.9795],"f1":[0.8119,0.8456]}}},"status":"READY","created":"2022-08-04T17:55:55.943368+00:00"}}

fig, ax = plt.subplots()

epochs = np.arange(0, response['data']['training_stats']['epoch'] + 1, 1, dtype=int)
accuracies = [0] + response['data']['training_stats']['metrics']['validation']['accuracy']
ax.plot(epochs, accuracies, label='Accuracy per epoch', marker=".", markersize=5)

plt.ylim(0, 1)
plt.title("Spam Detection - Accuracy per epoch")
ax.set_xticks(epochs)
#ax.set_yticks(accuracies)
ax.legend()
plt.savefig('figure.png')

Use the model for prediction

To use the model in the SliceX AI Predictor, you will need the model ID, in our case it was XXX. Then it’s only a POST request to the api.slicex.ai endpoint:

Custom prediction with the model
curl -X 'POST' \
'https://api.slicex.ai/predictor/language/model/XXX' \
-H 'accept: application/json' \
-H 'x-api-key: API Key' \
-H 'Content-Type: application/json' \
-d '{
"query": "You have 1 new voicemail. Please call 08719181513."
}'