Home ➤ Projects ➤

Sinhala Language Resources

Principal Investigator: Nisansa de Silva

We propose to build and analyze Sinhala language resources, datasets, and models to strengthen NLP research for this low-resource language, while connecting it to English and Tamil through bilingual lexicons, embeddings, and cross-lingual methods.

Sinhala, spoken by over 17 million people, remains an under-resourced language in NLP. This project aims to address this gap by systematically creating and analyzing resources and models tailored to Sinhala and its interaction with English and Tamil. We propose to compile and curate large-scale corpora from diverse domains such as news, social media, and YouTube comments, while building structured resources such as bilingual lexicons, WordNet expansions, and diachronic corpora.
On the modeling side, we will explore headline generation, transliteration, multi-document summarization, sentiment prediction, and classification methods. Special emphasis will be given to embedding alignment between Sinhala and English, enabling transfer learning and benchmarking for cross-lingual tasks. We will also investigate zero-shot OCR for Sinhala and Tamil, and experiment with adapting multilingual encoders and LLMs (e.g., SinLlama) for Sinhala.
The outcome of this work will be a comprehensive ecosystem of datasets, tools, and models that will not only advance Sinhala NLP but also benefit multilingual research across low-resource languages globally.

Objectives:

Build foundational linguistic resources for Sinhala, such as corpora and lexicons.
Conduct large-scale corpus studies on Sinhala text.
Explore effective neural and statistical models for Sinhala text tasks such as sentiment analysis, headline generation, transliteration, and summarization.
Evaluate and benchmark Sinhala datasets to guide future NLP work for low-resource languages.
Advance Sinhala-inclusive multilingual and cross-lingual NLP.

Publications

Dissertations

W.A.S.A Fernando, "Data Augmentation to Induce High Quality Parallel Data for Low-Resource Neural Machine Translation", University of Moratuwa, 2025

Theses: MSc Major Component Research

Menan Velayuthan, "Multi-Domain Neural Machine Translation with Knowledge Distillation For Low Resource Languages", University of Moratuwa, 2025

Theses: MSc Minor Component Research

Pubudu Cooray, "Headline Generation for Sinhala Newspaper Articles using Pre-trained Language Models", University of Moratuwa, 2025
Kasun Wickramasinghe, "Bilingual Lexicon Induction for the Sinhala-English Language Pair", University of Moratuwa, 2024

Theses: BSc

Isuranga Iniyage, Pathum Mihiranga, and Mithun Wijethunga, "Advanced Interactive Sinhala Dictionary", University of Moratuwa, 2022
Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, and Lahiru Lasandun, "Sinmin - Sinhala Corpus Project", University of Moratuwa, 2015

Journal Papers

W M Yomal De Mel and Nisansa de Silva, "Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study", ICTer, vol. 18, no. 2, pp. 121-130, 2025. doi: 10.4038/icter.v18i2.7299
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Facebook for Sentiment Analysis: Baseline Models to Predict Facebook Reactions of Sinhala Posts", The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022. doi: 10.4038/icter.v15i2.7248
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, and others, "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets", Transactions of the Association for Computational Linguistics, vol. 10, pp. 50--72, 2022. doi: 10.1162/tacl_a_00447

Conference Papers

Nevidu Jayatilleke and Nisansa de Silva, "SiDiaC: Sinhala Diachronic Corpus", in Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation, 2025, pp. 511--527.
Nevidu Jayatilleke and Nisansa de Silva, "Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil", in Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Processing in the Generative AI Era, 2025, pp. 471--480.
Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, and Mokanarangan Thayaparan, "Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11216992
H W K Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, and Rishemjit Kaur, "SinLlama-A Large Language Model for Sinhala", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11217094
Aloka Fernando, Nisansa de Silva, Menan Velayuthan, Charitha Rathnayaka, and Surangika Ranathunga, "Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics", in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 28252--28269. doi: 10.18653/v1/2025.emnlp-main.1435
Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, and Surangika Ranathunga, "Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches", in Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, 2025, pp. 166--173.
Kushan Hewapathirana, Nisansa de Silva, and C D Athuraliya, "M2DS: Multilingual Dataset for Multi-document Summarisation", in International Conference on Computational Collective Intelligence, 2024, pp. 219--231. doi: 10.1007/978-3-031-70248-8_17
Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, and Aloka Fernando, "Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research", in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, and Charitha Rathnayake, "Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora", in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian{'}s, Malta: Association for Computational Linguistics, mar. 2024, pp. 860--880.
Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language", in Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, 2023, pp. 424--435.
Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Parallel Word Dictionary Dataset", in 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), 2023, pp. 61--66. doi: 10.1109/ICIIS58898.2023.10253560
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages", in Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325--336. doi: 10.48550/ARXIV.2210.14472
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data", in 2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering, Association for Computing Machinery, 2022, pp. 16-22. doi: 10.1145/3512826.3512829
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Seeking Sinhala Sentiment: Predicting Facebook Reactions of Sinhala Posts", in 2021 21st International Conference on Advances in ICT for Emerging Regions (ICter), 2021, pp. 177-182. doi: 10.1109/ICter53630.2021.9774796
Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. de Silva, and Gihan Dias, "Comparison Between Performance of Various Database Systems for Implementing a Language Corpus", in International Conference: Beyond Databases, Architectures and Structures, May. 2015, pp. 82--91. doi: 10.1007/978-3-319-18422-7_7
Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. De Silva, and Gihan Dias, "Implementing a Corpus for Sinhala Language", in Symposium on Language Technology for South Asia 2015, 2015. doi: 10.13140/RG.2.2.23035.11047
Indeewari Wijesiri, Malaka Gallage, Buddhika Gunathilaka, Madhuranga Lakjeewa, Daya Wimalasuriya, Gihan Dias, Rohini Paranavithana, and Nisansa de Silva, "Building a WordNet for Sinhala", in Proceedings of the Seventh Global WordNet Conference, January. 2014, pp. 100--108.

Workshop Papers

Menan Velayuthan, Nisansa De Silva, and Surangika Ranathunga, "Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation", in Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), 2025, pp. 161--170.
Menan Velayuthan, Dilith Jayakody, Nisansa de Silva, Aloka Fernando, and Surangika Ranathunga, "Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods", in Proceedings of the Ninth Conference on Machine Translation, 2024, pp. 901--907. doi: 10.18653/v1/2024.wmt-1.87

White Papers

Yudhanjaya Wijeratne and Nisansa de Silva, "Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook", arXiv preprint arXiv:2007.07884, 2020. doi: 10.2139/ssrn.3650976
Yudhanjaya Wijeratne, Nisansa de Silva, and Yashothara Shanmugarajah, "Natural Language Processing for Government: Problems and Potential", LIRNEasia, 2019. doi: 10.13140/RG.2.2.34297.31845

Preprints

Akesh Gunathilake, Nadil Karunarathne, Tharusha Bandaranayake, Nisansa de Silva, and Surangika Ranathunga, "LMSpell: Neural Spell Checking for Low-Resource Languages", arXiv preprint arXiv:2512.05414, 2025
Nisansa de Silva, "Survey on Publicly Available Sinhala Natural Language Processing Tools and Research", arXiv preprint arXiv:1906.02358, 2019. doi: 10.48550/arXiv.1906.02358
Nisansa de Silva, "Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language", , 2015

Team

Faculty

Nisansa de Silva

Senior Lecturer

University of Moratuwa

MSc Students

Charitha Rathnayake

Lecture on Contract

University of Moratuwa

Nevidu Jayatilleke

Research Assistant (Assistant Lecturer Grade)

Informatics Institute of Technology

Yomal De Mel

Manager Finance

MAS Active

Undergraduates

Imalsha Puranegedara

Student

University of Moratuwa

Alumni-PhD Students

Aloka Fernando

Researcher / Visiting Lecturer

Informatics Institute of Technology

Alumni-MSc Students

Kasun Wickramasinghe

AI Research Engineer

Analog Inference

Kushan Hewapathirana

Machine Learning Engineer

ConscientAI

Pubudu Cooray

Lead Software Engineer

Insighture

Velayuthan Menan

AI Research Engineer

University of Moratuwa

Alumni-Undergraduates

Aravinda Kankanamge

Software Engineer Fellow

Lanka Software Foundation

Buddhika Gunathilaka

Software Engineer

Harlem Next

Chamila Wijayarathna

Software Engineer

Trovio

Dilith Jayakody

Graduate Student

Dalhousie University

Dimuthu Upeksha

Director of Engineering

Folia

Gihan Weeraprameshwara

Ph.D. Student

Michigan State University

Indeewari Wijesiri

Associate Technical Lead

WSO2

Isuranga Iniyage

Senior Software Engineer

Cut+Dry

Lahiru Lasandun

Senior Technical Lead

SenzMate

Madhuranga Lakjeewa

Software Engineer

Automic Group

Maduranga Siriwardena

Associate Director / Architect

WSO2

Malaka Gallage

Senior Full Stack Developer

Hitachi Energy

Mithun Wijethunga

Software Engineer III

UST

Nadil Karunarathne

Software Engineer

HYVOR

Nisal Ranathunga

Software Engineer

Yaala Labs

Pathum Mihiranga

Senior Software Engineer

Sysco LABS Sri Lanka

Rashad Sirajudeen

Software Engineer

WSO2

Samith Karunathilake

Software Engineer

WSO2

Tharusha Bandaranayake

Data Scientist

jahan.ai

Themira Chathumina

Software Engineer

Sysco Labs

Vihanga Jayawickrama

Lecturer (on Contract)

University of Moratuwa

Grants

This project was partially supported by the following grants:

2022

2025

Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English

$35,000 - Google/2022

We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka.

View grant