Training a Multilingual Model
Let us see how to configure a binary text classification training job for a Japanese dataset. We will utilize the handy multilingual friendly job customization options supported by the SliceX AI Trainer API.
Japanese is a non-segmented language where characters don't have spaces between them, like in English. Adding explicit delimiters can help improve the model’s quality of semantic parsing. To do this, we will set the add_tokenization
parameter under preprocessing
to " "
, to add spaces between characters.
What this will do is convert every sample in the dataset as follows a dummy sample:
金曜日の午前九時に起こしてください → 金 曜 日 の 午 前 九 時 に 起 こ し て く だ さ い
We also have available to us pre-trained SliceX AI models optimized for Japanese that we can initialize our custom model with. To do this simply set the language
field under training_config
to japanese
, making sure the use_pretrained
field has been set to true
.
A suitable example REQUEST body for this job with the above customizations would look like:
{
"preprocessing": {
"lowercase": true,
"add_tokenization":" "
},
"training_config": {
"batch_size": 64,
"learning_rate": 0.001,
"num_epochs": 5,
"language": "japanese",
"data_augmentation": false,
"use_pretrained": true
},
"model_config": {
"name": "model_name",
"type": "binary-text-classification",
"family": "Papaya",
"size": "mini"
},
"dataset_url": "DATASET_URL"
}
See the Customize training section for more details.