Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English
Principal Investigator: Surangika Ranathunga
We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka.
We propose to build a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English. Sinhala and Tamil are the official languages of Sri Lanka, with English being a link language. Such a system will enable the Sri Lankan population to refer to information written in all three languages. This would be beneficial mainly for the marginalized Tamil speaking minority, to refer to Sinhala content. We hope to quality estimate the existing parallel corpora for these language pairs and denoise few of them having reasonable accuracy. We also plan to build new parallel corpora for some more domains. Our aim is to have parallel corpora for at least 7 different domains, with each corpus having at least 25000 parallel sentences. Finally, we plan to build a multi-domain NMT model that would be robust to domain differences. This model will be built on top of mBART50/mT5/NLLB models and would employ knowledge distillation methods.
Objectives:
- Quality estimate the existing parallel datasets available for these three languages. We plan to take the existing datasets and manually evaluate them. Our evaluation results will help future researchers to decide which datasets to select in the context of these language pairs.
- Select existing datasets with an acceptable level of quality (determined in step 1) and manually denoise them.
- Create parallel corpora for new domains using semi automated techniques.
- Build a Neural Machine Translation (NMT) model for Sinhala-Tamil, Sinhala-English and Tamil-English that is robust to domain differences.
Keywords: Natural Language Processing | Machine Learning / Deep Learning | Sinhala |
Publications
Dissertations
W.A.S.A Fernando, "Data Augmentation to Induce High Quality Parallel Data for Low-Resource Neural Machine Translation", University of Moratuwa, 2025
Theses: MSc Major Component Research
Menan Velayuthan, "Multi-Domain Neural Machine Translation with Knowledge Distillation For Low Resource Languages", University of Moratuwa, 2025
Conference Papers
Aloka Fernando, Nisansa de Silva, Menan Velayuthan, Charitha Rathnayaka, and Surangika Ranathunga, "Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics", in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 28252--28269. doi: 10.18653/v1/2025.emnlp-main.1435
Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, and Aloka Fernando, "Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research", in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, and Charitha Rathnayake, "Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora", in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian{'}s, Malta: Association for Computational Linguistics, mar. 2024, pp. 860--880.

Workshop Papers
Menan Velayuthan, Nisansa De Silva, and Surangika Ranathunga, "Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation", in Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), 2025, pp. 161--170.
Menan Velayuthan, Dilith Jayakody, Nisansa de Silva, Aloka Fernando, and Surangika Ranathunga, "Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods", in Proceedings of the Ninth Conference on Machine Translation, 2024, pp. 901--907. doi: 10.18653/v1/2024.wmt-1.87
Team
Faculty
MSc Students
Alumni-PhD Students
Alumni-MSc Students
Alumni-Undergraduates
Grants






