HomeGrants

Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English

Grant Amount:
$35,000

Grant Source:
Google

Grant Program:
Google Award for Inclusion Research

Grant Code:
Google/2022

Grant Duration:
2022/08 - 2023/08

Principal Investigator: Surangika Ranathunga

Co-Investigators: Nisansa de Silva


Keywords: Neural Machine Translation | Low Resource Languages | MultiLingual Machine Translation | English Language | Tamil Language | Sinhala Language | Natural Language Processing |


We propose to build a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English. Sinhala and Tamil are the official languages of Sri Lanka, with English being a link language. Such a system will enable the Sri Lankan population to refer to information written in all three languages. This would be beneficial mainly for the marginalized Tamil speaking minority, to refer to Sinhala content. We hope to quality estimate the existing parallel corpora for these language pairs and denoise few of them having reasonable accuracy. We also plan to build new parallel corpora for some more domains. Our aim is to have parallel corpora for at least 7 different domains, with each corpus having at least 25000 parallel sentences. Finally, we plan to build a multi-domain NMT model that would be robust to domain differences. This model will be built on top of mBART50/mT5/NLLB models and would employ knowledge distillation methods.


Objectives:

  • Quality estimate the existing parallel datasets available for these three languages. We plan to take the existing datasets and manually evaluate them. Our evaluation results will help future researchers to decide which datasets to select in the context of these language pairs.
  • Select existing datasets with an acceptable level of quality (determined in step 1) and manually denoise them.
  • Create parallel corpora for new domains using semi automated techniques.
  • Build a Neural Machine Translation (NMT) model for Sinhala-Tamil, Sinhala-English and Tamil-English that is robust to domain differences.