Skip to main content

Dataset Requirements

The dataset requirements depend on the tasks. We advise reviewing this section carefully before starting a training job- to ensure a smooth custom training experience.

Text classification

In a ZIP folder, please include train.tsv, dev.tsv, config.json & label_id.json files using the formats below.

train.tsv & dev.tsv:

For binary & multi class classification, the header (first ) row in each TSV file should be: “sentence TAB label”. e.g.,

train.tsv example
sentence TAB label

This is the first test sentence TAB 0

This is the second test sentence TAB 3

For multilabel classification, the header (first) row in each TSV file should be:

train.tsv multilabel first row configuration
sentence TAB label_1 TAB label_2… TAB label_k
This is the first test sentence TAB 0
This is the second test sentence TAB 3

For the row entries, specify 0 or 1 as value for the column corresponding to each class. e.g., multilabel with 3 classes:

train.tsv multilabel example
sentence TAB label_1 TAB label_2 TAB label_3
This is the first test sentence TAB 0 TAB 1 TAB 1
This is the second test sentence TAB 1 TAB 1 TAB 0

config.json

config.json
{
"desc": "dataset-description",
"DATASET_NAME": "my-dataset",
"TRAIN_FILE": "train.tsv",
"DEV_FILE": "dev.tsv",
"field_names": ["sentence", "label"],
"field_header": 0,
"NUM_CLASS_LABELS": "number-of-classes"
}
info

NUM_CLASS_LABELS should be 1 for binary classification, and number of classes otherwise. For example, for a multiclass classification task with 5 classes, it should be 5.

label_id.json

This should include the mapping from label IDs to label names.

label_id.json example
{
"0": "negative",
"1": "positive"
}
caution
  • Please note that the JSON file must be correctly formatted, one common error is using single quotes ' ', instead of double quotes " ". Another one is a missing comma at the end of a field entry.

  • If you use pandas to generate the TSV files, please make sure that the type of the labels is int and not str.

Sequence Labeling

In a ZIP folder, please include two folders, called train and val, and one JSON file, called label_id.json, using the formats below.

train & val folders:

The train and val folders should each contain two TXT files called sentences.txt and tags.txt.

sentences.txt

The sentences.txt file contains the sentences, one per line. e.g.,

sentences.txt
This is an example PersonFirstName PersonLastName .
This is an example Location .

tags.txt

The tags.txt file contains the corresponding tags, following the BIO tagging format, in the same order as the sentences.txt file. Each line in the tags.txt file corresponds to the tag sequence for the corresponding line in the sentences.txt file. e.g.,

sentences.txt
O O O O B-PER I-PER O
O O O O B-LOC O

info

The tag count each line in tags.txt should exactly match the word (or token) count in the corresponding line from sentences.txt. The tags should follow the BIO tagging format (short for Beginning, Inside, Outside), where:

  • The B-prefix before a tag indicates that the tag is the beginning of an entity, and an I-prefix before a tag indicates that the tag is inside an entity.
  • An O tag indicates that a token belongs to no entity.
  • The I-tag is used only when a tag is followed by a tag of the same entity without O tokens between them. For example, New York is a city => B-LOC I-LOC O O O Paris is a city => B-LOC O O O

label_id.json

This should indicate a mapping from tags IDs to tags names. e.g. (for the PER, MISC, ORG and LOC tags).

label_id.json example
{
"0": "B-LOC",
"1": "B-MISC",
"2": "B-ORG",
"3": "B-PER",
"4": "I-LOC",
"5": "I-MISC",
"6": "I-ORG",
"7": "I-PER",
"8": "O"
}
caution
  • Please note that the JSON file must be correctly formatted, one common error is using single quotes ' ', instead of double quotes " ". Another one is a missing comma between lines.

Question Answering

In a ZIP folder, please include train.json, dev.json, config.json & label_id.json files using the formats below.

train.json & dev.json

train.json & dev.json
{
"0":
{
"question": "question text"
"context": "passage or document text",
},
"1":
{
"question": "question text"
"context": "passage or document text",
},

}

The key is an index for the training example and value contains a question-context pair. The “question” refers to the query text & “context” refers to the passage (or document) text that contains the answer to the question.

Here is an example train.json file.

train.json file example
{
"0":
{
"question": "When was the Compass-M1 satellite launched?",
"context": "The first satellite of the second-generation system, Compass-M1 was launched in 2007. It was followed by further nine satellites during 2009-2011, achieving functional regional coverage. A total of 16 satellites were launched during this phase."
},
"1":
{
"question": "How many satellites were launched after Compass-M1?",
"context": "The first satellite of the second-generation system, Compass-M1 was launched in 2007. It was followed by further nine satellites during 2009-2011, achieving functional regional coverage. A total of 16 satellites were launched during this phase."
}
}

config.json

config.json
{
"desc": “dataset-description”,
"DATASET_NAME": “my-dataset”,
"DATA_DIR": “path-to-parent-directory-for-data”,
"TRAIN_FILE": “train.json“,
"DEV_FILE": “dev.json“
}

label_id.json

label_id.json contains all the passages (or documents). This file contains the universal collection of passages (or documents) that is used to retrieve the answer to any input question. The format for label_id.json is as follows:

label.json format
{
"0": “target-passage-text”,
"1": “target-passage-text”,

}

The key is a passage (or document) index and value is the passage text.

Here is an example label_id.json file.

label.json example
{
"0": "The first satellite of the second-generation system, Compass-M1 was launched in 2007. It was followed by further nine satellites during 2009-2011, achieving functional regional coverage. A total of 16 satellites were launched during this phase.",
"1": "Three-phase AC railway electrification was used in Italy, Switzerland and the United States in the early twentieth century. Italy was the major user, for lines in the mountainous regions of northern Italy from 1901 until 1976. The first lines were the Burgdorf-Thun line in Switzerland (1899), and the lines of the Ferrovia Alta Valtellina from Colico to Chiavenna and Tirano in Italy, which were electrified in 1901 and 1902. Other lines where the three-phase system were used were the Simplon Tunnel in Switzerland from 1906 to 1930, and the Cascade Tunnel of the Great Northern Railway in the United States from 1909 to 1927."
}