Question Answering over Wikipedia

Overview

Have you ever wanted to build an algorithm that is as smart as you are (or probably smarter!)? Well, you are one step closer to doing this using this next tutorial on training a custom Question-Answering model (QA - model) using our light-weight high-performance SliceX AI Trainer.

Question answering (QA) models are NLP-based models that look to answer questions based on text passages. These models automatically answer questions based on what they have learnt based on the passages they were trained on given an input question. One thing to note is how the answering knowledge of a QA model is limited to the data that it was trained on just like how our answering skills are limited to the knowledge we’ve gained over time.

Example Inference Input-Output
Input: "How does a heart attack occur?"
Top result from the SliceX AI trained model:
"Myocardial infarction (MI) or acute myocardial infarction (AMI), commonly known as a heart attack, occurs when blood flow stops to a part of the heart causing damage to the heart muscle. The most common symptom is chest pain or discomfort which may travel into the shoulder, arm, back, neck, or jaw. Often it is in the center or left side of the chest and lasts for more than a few minutes. The discomfort may occasionally feel like heartburn. Other symptoms may include shortness of breath, nausea, feeling faint, a cold sweat, or feeling tired. About 30% of people have atypical symptoms, with women more likely than men to present atypically. Among those over 75 years old, about 5% have had an MI with little or no history of symptoms. An MI may cause heart failure, an irregular heartbeat, or cardiac arrest."

Input: "What is an alloy?" 
Top 2nd result from the SliceX AI trained model:
"The term alloy is used to describe a mixture of atoms in which the primary constituent is a metal. The primary metal is called the base, the matrix, or the solvent. The secondary constituents are often called solutes. If there is a mixture of only two types of atoms, not counting impurities, such as a copper-nickel alloy, then it is called a binary alloy. If there are three types of atoms forming the mixture, such as iron, nickel and chromium, then it is called a ternary alloy. An alloy with four constituents is a quaternary alloy, while a five-part alloy is termed a quinary alloy. Because the percentage of each constituent can be varied, with any mixture the entire range of possible variations is called a system. In this respect, all of the various forms of an alloy containing only two constituents, like iron and carbon, is called a binary system, while all of the alloy combinations possible with a ternary alloy, such as alloys of iron, carbon and chromium, is called a ternary system."

Now that you’ve seen the task at hand, let’s start building your QA-algorithm using SliceX AI Trainer.

note

Note that the QA-models are available only with the Pro tier membership. To try this out, please upgrade your membership. See the Pricing section.

Dataset Preprocessing

The Stanford Question Answering Dataset (SQuAD) was introduced in 2016 to facilitate the training of QA models which comprises questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. The choice of the SQuAD for this tutorial stems from its size of a whopping 100,000+ questions, and is hands down probably one of the most popular question answering datasets (it’s been cited over 2,000 times) because it’s well-created and improves on many aspects that other datasets fail to address.

You can download the dataset from here: https://rajpurkar.github.io/SQuAD-explorer/.

Having downloaded the SQuAD, we need to convert this dataset to the required format for the SliceX AI Trainer. Please refer to the Dataset Requirements section for specific instructions as to what is required to get started with this task.

Here’s some code that will get you started with the pre-processing of the SQuAD.

Dataset Preprocessing in Python
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# function to parse the train.json file downloaded from the SQuAD website into a pandas df
def squad_json_to_df(input_file_path, record_path = ['data','paragraphs','qas','answers'],
                           verbose = 1):
    """
    input_file_path: path to the squad json file.
    record_path: path to deepest level in json file default value is
    ['data','paragraphs','qas','answers']
    verbose: 0 to suppress it default is 1
    """
    if verbose:
        print("Reading the json file")    
    file = json.loads(open(input_file_path).read())
    if verbose:
        print("Processing...")
    # parsing different levels in the json file
    js = pd.io.json.json_normalize(file , record_path )
    m = pd.io.json.json_normalize(file, record_path[:-1] )
    r = pd.io.json.json_normalize(file,record_path[:-2])
    
    #combining it into single dataframe
    idx = np.repeat(r['context'].values, r.qas.str.len())
    ndx  = np.repeat(m['id'].values,m['answers'].str.len())
    m['context'] = idx
    js['q_idx'] = ndx
    main_df= pd.concat([ m[['id','question','context']].set_index('id'),js.set_index('q_idx')],1,sort=False).reset_index()
    main_df['c_id'] = main_df['context'].factorize()[0]
    if verbose:
        print("shape of the data-frame is {}".format(main_df.shape))
        print("Done")
    return main_df


# creating a baseline df from the train.json that was downloaded from the website making sure you have your input file path to train.json
input_file_path = 'train.json' 
record_path = ['data','paragraphs','qas','answers']
baseline_df = squad_json_to_df(input_file_path=input_file_path,record_path=record_path)
# creating a df with only the question-context pairs
squad_df = baseline_df[['question','context']]
# removing the duplicate records using questions as a filtersquad_df = squad_df.drop_duplicates(subset=['question'])
squad_df = squad_df.reset_index(drop=True)
# shuffling the dataset with a random seed
squad_df = squad_df.sample(
                            frac=1,
                            random_state=1
                            ).reset_index(drop=True)
# making sure no duplicates have trickled down into the final dataset                            
print(f"Number of unique records in the SQuAD:: {len(squad_df['question'].drop_duplicates())}")

We need to put our data to the following format:

General dataset format for Q&A
{
"0":
{
"question": "<question text>"
"context": "<passage or document text>",
},
"1":
{
"question": "<question text>"
"context": "<passage or document text>",
},
…
}

The key is an index for the training example and the value contains a question-context pair. The “question” refers to the query text & “context” refers to the passage (or document) text that contains the answer to the question.

Using the following code, let’s split our original df into train and dev data, which is then dumped into train.json and dev.json respectively as required for training the SliceX AI Trainer.

Dataset formatting
# code to create train.json and test.json files
# doing the train val test split 80/20 with default shuffling with a seed
train_ratio = 0.8
# train and val split
train_df, dev_df = train_test_split(squad_df, test_size=1 - train_ratio, shuffle=False)
# resetting the index of the dev_df so that our keys start from zero
dev_df=dev_df.reset_index(drop=True)
# dumping the df into the train.json and dev.json
train_df[['question','context']].to_json(path_or_buf='squad-qa-files/train.json',orient='index')
dev_df[['question','context']].to_json(path_or_buf='squad-qa-files/dev.json',orient='index')

Finally, here is how our train.json and dev.json should look like after all the previous steps.

Dataset format for train.json and dev.json
{
    "0":
        {
            "question":"What Boston investment firm ... in the 1980's?",
            "context":"Other important industries ... higher education."
        },
    "1":
        {
            "question":"When was there a vote ... action in Michigan?",
            "context":"Some opponents further ... the individual person."
        },
    "2":
        {
            "question":"When did Britain end slavery?",
            "context":"Also, other human ... founded the Red Cross."
        },
    "3":
        {
            "question":"Why was the Quianjiang Municipality formed?",
            "context":"From 1955 until 1997 ... Three Gorges Dam project."
        },
    "4":
        {
            "question":"The business ... have been praised?",
            "context":"Ben Goldacre ... be particularly effective."
        }
}

Now that we have our question-context pairs ready in our train.json and dev.json files, let’s prepare everything else we need to do our training. Both of these files will be dumped into our folder called squad-qa-files.

We will need to enclose the following files in a ZIP folder called squad-qa-files - train.json, dev.json, config.json & label_id.json files using the formats below. Now let’s tackle the config.json & label_id.json files.

Let’s prepare a config.json format with required fields as shown below:

Dataset format for train.json and dev.json
{
    "desc": "SQUAD QA",
    "DATASET_NAME": "squad",
    "DATA_DIR": "./data/squad/",
    "TRAIN_FILE": "train.json",
    "DEV_FILE": "dev.json",
    "TEST_FILE": ""
}

Lastly, we need a label_id.json file that contains all the passages (or documents). This file contains the universal collection of passages (or documents) that is used to retrieve the answer to any input question. The format for label_id.json is as follows:

label_id.json
{
"0": "<target-passage-text>",
"1": "<target-passage-text>",
…
}

The key is a passage (or document) index and value is the passage text. Here is a snippet of how to generate the label_id.json and what it should look like.

label_id.json generation
# extract all the unique context values to dump them into a label_id.json file
context_unique = squad_df['context'].unique()
index = []
# creating key-context pairs like "0":"<context-1>", "1":"<context-2>" and so on. 
for i in range(len(context_unique)):
    index.append(str(i))
    dict_labels = dict(zip(index,context_unique))
# dumping the dict-labels into a json file that's stored in the squad-qa folder
with open("squad-qa-files/label_id.json", "w") as outfile:
    json.dump(dict_labels, outfile)

The snippets from the label_id.json should look like this:

Final label_id.json
{
    "0": "Other important industries are ... higher education", 
    "1": "Some opponents further claim ... individual person.",
    "2": "Also, other human rights were ... He also founded the Red Cross.", 
    "3": "From 1955 until 1997 Sichuan had been  ... areas of the Three Gorges Dam project.",
    "4": "Ben Goldacre has argued that regulators ... particularly effective.",
    "5": "Asphalt/bitumen also ... oil refineries in Canada and the United States.",
    "6": "Greek drama exemplifies ... electronic media."
}    

Finally, zip the dataset folder and upload it somewhere to get its URL. For Google Drive, please refer to the Google Drive Tutorial. The important aspect is that the URL isn’t protected, when we visit it, the downloading starts.

Training the QA model

Launching the training job

We launch the training job with the following command:

Launch the training job
curl -X 'POST' \
'https://api.slicex.ai/trainer/language/training-jobs' \
 -H 'accept: application/json' \
 -H 'x-api-key: API_KEY' \
 -H 'Content-Type: application/json' \
 -d '{
 "training_config": {
  "batch_size": 32,
  "learning_rate": 0.001,
  "num_epochs": 15,
  "language": "english",
  "data_augmentation": false,
  "use_pretrained": true
 },
 "model_config": {
  "name": "question-answering-squad",
  "type": "question-answering",
  "family": "Dragonfruit",
  "size": "base"
 },
 "dataset_url": DATASET_URL"
}'

In the response you will get the model ID, which you would need to retrieve the training stats and status, or infer with the model.

"{"id":MODEL_ID}"

In case you lost the model ID, you can retrieve it using this endpoint:

List models
curl -X 'GET' \
 "https://api.slicex.ai/trainer/language/training-jobs" \
-H "accept: application/json" \
-H "x-api-key: API_KEY"

Monitor the training Job

To get the status and training stats, we use the following command, that uses the model ID:

Get job status
curl -X 'GET' \ 'https://api.slicex.ai/trainer/language/training-jobs/MODEL_ID' \
  -H 'accept: application/json' \
  -H 'x-api-key: API_KEY' \
  -H 'Content-Type: application/json'

We get the following response:

Response example
{"data":{"model":{"id":"MODEL_ID","app_id":"API_KEY","name":"question-answering-squad","training_status":"READY","type":"question-answering","modality":"language","created":"2022-08-05T17:50:57.699864+00:00"},"training_stats":{"batch_size":32,"epoch":15,"metrics":{"elapsed_time":[839.8228869438171,783.0923538208008,776.2446513175964,777.7096049785614,775.3144462108612,777.3247337341309,772.0683574676514,771.1185009479523,770.2326591014862,769.8967583179474,766.1156423091888,778.0096859931946,774.3462924957275,771.4510288238525,773.1044895648956],"training":{"loss":[5.323059407660065,2.996610439471736,1.7483903126246514,1.200965930929714,0.8971506557919323,0.7002060867260217,0.546549862978353,0.43722077768469914,0.35012616948922526,0.28137949741095375,0.22462407819908434,0.18272861086875566,0.14775136901816258,0.11842509240589359,0.10307786259251013]},"validation":{"topk":[{"1":0.006,"3":0.0135,"5":0.0189,"10":0.0293,"20":0.0448,"100":0.116},{"1":0.0751,"3":0.1378,"5":0.174,"10":0.2302,"20":0.2968,"100":0.4696},{"1":0.1292,"3":0.2225,"5":0.2751,"10":0.3576,"20":0.442,"100":0.6302},{"1":0.1594,"3":0.2695,"5":0.3269,"10":0.4142,"20":0.5018,"100":0.6885},{"1":0.1881,"3":0.3051,"5":0.3682,"10":0.458,"20":0.547,"100":0.7272},{"1":0.2039,"3":0.3291,"5":0.3944,"10":0.486,"20":0.5766,"100":0.7503},{"1":0.218,"3":0.3452,"5":0.4113,"10":0.5034,"20":0.5951,"100":0.7677},{"1":0.2219,"3":0.3539,"5":0.42,"10":0.5141,"20":0.6062,"100":0.7748},{"1":0.2348,"3":0.3682,"5":0.435,"10":0.5307,"20":0.6226,"100":0.7894},{"1":0.2383,"3":0.3723,"5":0.4402,"10":0.5389,"20":0.631,"100":0.7962},{"1":0.2443,"3":0.3839,"5":0.453,"10":0.5488,"20":0.6384,"100":0.802},{"1":0.2518,"3":0.3883,"5":0.4598,"10":0.5545,"20":0.6479,"100":0.8072},{"1":0.2558,"3":0.3946,"5":0.4643,"10":0.5586,"20":0.6516,"100":0.8117},{"1":0.2573,"3":0.3998,"5":0.4677,"10":0.5643,"20":0.6542,"100":0.8151},{"1":0.2599,"3":0.4025,"5":0.47,"10":0.568,"20":0.6574,"100":0.8159}]}}},"status":"READY","created":"2022-08-05T17:50:57.699864+00:00"}}

Using the following script, we can plot the training stats to visualize the training process.

Plotting the response with Python
import matplotlib.pyplot as plt
import numpy as np

# Function to extract top-k accuracies from each epoch
def extract_top_accuracy(accuracies):
    aggregate = []
    for accuracy in accuracies:
        aggregate.append(accuracy['100'])
    return aggregate


# Paste your training stats here
training_data = *INSERT_YOUR_TRAINING_STATS*
fig, ax = plt.subplots()

# Reading all the info from the training stats
stats =training_data['data']['training_stats']
training_loss = stats['metrics']['training']['loss']
accuracies =stats['metrics']['validation']['topk']

# accuracy and epochs
accuracies_plot = extract_top_accuracy(accuracies)
epochs = np.arange(1, stats['epoch'] + 1, 1, dtype=int)

# Plotting the accuracy per epoch vs epoch
ax.plot(epochs, accuracies_plot, 'r', label='Accuracy per epoch', marker=".", markersize=5)
 
plt.ylim(0, 1)
plt.title("SQuAD QA - Top K Accuracy per epoch")
plt.xlabel("Epoch")
plt.ylabel("Accuracy per epoch")
ax.legend()
plt.savefig('figure.png')

The current model was trained for 15 epochs but it reached around 80% top accuracy within 12 epochs. In the above code, make sure to paste your training stats just like the stats shown in the response above to generate a plot.

note

If your training crashed for some reason, please double-check your sliceX key, training json files or shoot us an email at support@slicex.ai.

Tweaking and customizing the training Job:

You can play around with the hyperparameters like the batchsize, number of epochs and learning rate available to you. It is to be noted that you can use any of your custom datasets to train the question-context pairs as long as they align with the dataset requirements and train your custom models.

Custom inference with the model

Now, let’s test the model that you just created.

To use the model in the SliceX AI Predictor, you need the model ID. Then it’s only a POST request to the SliceX AI endpoint:

Custom Inference
curl -X 'POST' \
  'https://api.slicex.ai/predictor/language/model/MODEL_ID' \ 
  -H 'accept: application/json' \
  -H 'x-api-key: API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "How does a heart attack occur?"         
  }'

The above query should give you the top-10 responses as shown below along with the scores for each of them:

Response example
{"data":{"labels":["Myocardial infarction (MI) or acute myocardial infarction (AMI), commonly known as a heart attack, occurs ... an irregular heartbeat, or cardiac arrest.",
"At common law, ...  for example driving a car or flying an airplane.",
"The last three years of Eisenhower's second term in office ...  suffered seven heart attacks in total from 1955 until his death.",
"Solar water disinfection (SODIS) ... for their daily drinking water.",
"Prevention of infectious diarrhea... in those with severe disease.",
"In the US Air Force, ... strategic bombers.",
"Special guest referees ...  with the match.",
"Complications may occur immediately following the heart attack ... an increased risk of a second MI.",
"If impaired blood flow to the heart lasts long enough ...  catastrophic consequences.",
"There is some controversy surrounding it. ... instead of saturated fat."],
"Scores":[0.7028814554214478,0.6771444082260132,0.6721974015235901,0.6709581613540649,0.6660434007644653,0.6640133857727051,0.6638041138648987,0.6589502692222595,0.6583992838859558,0.6568634510040283]},\
"metadata":{"model_inference_time_ms":84.98291015625}}

The responses have been truncated here to fit the page. You can see that there are a mix of answers but the ones with the best scores are most related to the query passed here. One thing to note is that we trained on the question-passage instead of the question-answer pairs here, so our model has learnt really well to identify the passage best related to the passed query and also output related queries. The second response pertains to Eisenhower’s heart attacks which explains why it was outputted.

Feel free to try as many queries within the scope of the SQuAD, or your custom dataset to evaluate the model that you just trained.

Wrapping it up

Congrats! Give yourself a pat on your back, you’ve deployed your first QA-model using the SliceX AI Trainer. 🚀 Now that you’ve mastered using our SliceX AI Trainer for training your own QA algorithm, you could try it out with your custom dataset depending on your application.

If you liked this tutorial, do become a master of our SliceX AI Trainer by trying out our other tutorials.

Question Answering over Wikipedia

Overview​

Dataset Preprocessing​

Training the QA model​

Launching the training job​

Monitor the training Job​

Tweaking and customizing the training Job:​

Custom inference with the model​

Wrapping it up​