Effect of Tokenisation Strategies for Low-Resourced Southern African Languages

Abstract

Research into machine translation for African languages is very limited and low- resourced in terms of datasets and model evaluations. This work aims to add to the field of neural machine translation research, for four low-resourced Southern African languages. The effect of two byte pair encoding tokenisation algorithms (subword nmt and SentencePiece), with various parameters, are evaluated. The paper builds upon previous research in the field for comparison, using an optimised transformer architecture and pre-cleaned data to translate English to Northern Sotho, Setswana, Xitsonga and isiZulu. The results obtained show improvements in the previous BLEU scores obtained for Setswana and isiZulu.

Publication
3rd Workshop on African Natural Language Processing
Jenalea Rajab
Jenalea Rajab

Currently completing my MSc research in Addressing Ambiguity in Human Robot Interaction using Compositional Reinforcement Learning for Adaptive Task Inference