Home ➤ Projects ➤

Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English

Principal Investigator: Surangika Ranathunga

We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka.

We propose to build a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English. Sinhala and Tamil are the official languages of Sri Lanka, with English being a link language. Such a system will enable the Sri Lankan population to refer to information written in all three languages. This would be beneficial mainly for the marginalized Tamil speaking minority, to refer to Sinhala content. We hope to quality estimate the existing parallel corpora for these language pairs and denoise few of them having reasonable accuracy. We also plan to build new parallel corpora for some more domains. Our aim is to have parallel corpora for at least 7 different domains, with each corpus having at least 25000 parallel sentences. Finally, we plan to build a multi-domain NMT model that would be robust to domain differences. This model will be built on top of mBART50/mT5/NLLB models and would employ knowledge distillation methods.

Objectives:

Quality estimate the existing parallel datasets available for these three languages. We plan to take the existing datasets and manually evaluate them. Our evaluation results will help future researchers to decide which datasets to select in the context of these language pairs.
Select existing datasets with an acceptable level of quality (determined in step 1) and manually denoise them.
Create parallel corpora for new domains using semi automated techniques.
Build a Neural Machine Translation (NMT) model for Sinhala-Tamil, Sinhala-English and Tamil-English that is robust to domain differences.

Keywords: Natural Language Processing | Machine Learning / Deep Learning | Sinhala |

Publications

Dissertations

W.A.S.A Fernando, "Data Augmentation to Induce High Quality Parallel Data for Low-Resource Neural Machine Translation", University of Moratuwa, 2025

Theses: MSc Major Component Research

Menan Velayuthan, "Multi-Domain Neural Machine Translation with Knowledge Distillation For Low Resource Languages", University of Moratuwa, 2025

Conference Papers

Aloka Fernando, Nisansa de Silva, Menan Velayuthan, Charitha Rathnayaka, and Surangika Ranathunga, "Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics", in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 28252--28269. doi: 10.18653/v1/2025.emnlp-main.1435
Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, and Aloka Fernando, "Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research", in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, and Charitha Rathnayake, "Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora", in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian{'}s, Malta: Association for Computational Linguistics, mar. 2024, pp. 860--880.

Workshop Papers

Menan Velayuthan, Nisansa De Silva, and Surangika Ranathunga, "Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation", in Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), 2025, pp. 161--170.
Menan Velayuthan, Dilith Jayakody, Nisansa de Silva, Aloka Fernando, and Surangika Ranathunga, "Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods", in Proceedings of the Ninth Conference on Machine Translation, 2024, pp. 901--907. doi: 10.18653/v1/2024.wmt-1.87

Team

Faculty

Nisansa de Silva

Senior Lecturer

University of Moratuwa

MSc Students

Charitha Rathnayake

Lecture on Contract

University of Moratuwa

Alumni-PhD Students

Aloka Fernando

Researcher / Visiting Lecturer

Informatics Institute of Technology

Alumni-MSc Students

Velayuthan Menan

AI Research Engineer

University of Moratuwa

Alumni-Undergraduates

Dilith Jayakody

Graduate Student

Dalhousie University

Grants

2022

2023

Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English

$35,000 - Google/2022

We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka. View grant