HomeProjects

Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English

Principal Investigator: Surangika Ranathunga

We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka.

We propose to build a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English. Sinhala and Tamil are the official languages of Sri Lanka, with English being a link language. Such a system will enable the Sri Lankan population to refer to information written in all three languages. This would be beneficial mainly for the marginalized Tamil speaking minority, to refer to Sinhala content. We hope to quality estimate the existing parallel corpora for these language pairs and denoise few of them having reasonable accuracy. We also plan to build new parallel corpora for some more domains. Our aim is to have parallel corpora for at least 7 different domains, with each corpus having at least 25000 parallel sentences. Finally, we plan to build a multi-domain NMT model that would be robust to domain differences. This model will be built on top of mBART50/mT5/NLLB models and would employ knowledge distillation methods.

Objectives:

  • Quality estimate the existing parallel datasets available for these three languages. We plan to take the existing datasets and manually evaluate them. Our evaluation results will help future researchers to decide which datasets to select in the context of these language pairs.
  • Select existing datasets with an acceptable level of quality (determined in step 1) and manually denoise them.
  • Create parallel corpora for new domains using semi automated techniques.
  • Build a Neural Machine Translation (NMT) model for Sinhala-Tamil, Sinhala-English and Tamil-English that is robust to domain differences.


Keywords: Natural Language Processing | Machine Learning / Deep Learning | Sinhala |




Publications

Dissertations

Theses: MSc Major Component Research

Conference Papers

Workshop Papers

Team


Faculty

Nisansa de Silva

Senior Lecturer
University of Moratuwa

MSc Students

Charitha Rathnayake

Lecture on Contract
University of Moratuwa

Alumni-PhD Students

Aloka Fernando

Researcher / Visiting Lecturer
Informatics Institute of Technology

Alumni-MSc Students

Velayuthan Menan

AI Research Engineer
University of Moratuwa

Alumni-Undergraduates

Dilith Jayakody

Graduate Student
Dalhousie University

Grants

2022
08
-
2023
08
Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English
$35,000 - Google/2022
We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka.