Skip to main content

Training a Multilingual Model

Let us see how to configure a binary text classification training job for a Japanese dataset. We will utilize the handy multilingual friendly job customization options supported by the SliceX AI Trainer API.

Japanese is a non-segmented language where characters don't have spaces between them, like in English. Adding explicit delimiters can help improve the model’s quality of semantic parsing. To do this, we will set the add_tokenization parameter under preprocessing to " ", to add spaces between characters.

What this will do is convert every sample in the dataset as follows a dummy sample:

金曜日の午前九時に起こしてください → 金 曜 日 の 午 前 九 時 に 起 こ し て く だ さ い

We also have available to us pre-trained SliceX AI models optimized for Japanese that we can initialize our custom model with. To do this simply set the language field under training_config to japanese, making sure the use_pretrained field has been set to true.

A suitable example REQUEST body for this job with the above customizations would look like:

Multilingual Training Job Configuration Example
{
"preprocessing": {
"lowercase": true,
"add_tokenization":" "
},
"training_config": {
"batch_size": 64,
"learning_rate": 0.001,
"num_epochs": 5,
"language": "japanese",
"data_augmentation": false,
"use_pretrained": true
},
"model_config": {
"name": "model_name",
"type": "binary-text-classification",
"family": "Papaya",
"size": "mini"
},
"dataset_url": "DATASET_URL"
}

See the Customize training section for more details.