Sinhala Language Resources
Principal Investigator: Nisansa de Silva
We propose to build and analyze Sinhala language resources, datasets, and models to strengthen NLP research for this low-resource language, while connecting it to English and Tamil through bilingual lexicons, embeddings, and cross-lingual methods.
Sinhala, spoken by over 17 million people, remains an under-resourced language in NLP. This project aims to address this gap by systematically creating and analyzing resources and models tailored to Sinhala and its interaction with English and Tamil. We propose to compile and curate large-scale corpora from diverse domains such as news, social media, and YouTube comments, while building structured resources such as bilingual lexicons, WordNet expansions, and diachronic corpora.
On the modeling side, we will explore headline generation, transliteration, multi-document summarization, sentiment prediction, and classification methods. Special emphasis will be given to embedding alignment between Sinhala and English, enabling transfer learning and benchmarking for cross-lingual tasks. We will also investigate zero-shot OCR for Sinhala and Tamil, and experiment with adapting multilingual encoders and LLMs (e.g., SinLlama) for Sinhala.
The outcome of this work will be a comprehensive ecosystem of datasets, tools, and models that will not only advance Sinhala NLP but also benefit multilingual research across low-resource languages globally.
Objectives:
- Build foundational linguistic resources for Sinhala, such as corpora and lexicons.
- Conduct large-scale corpus studies on Sinhala text.
- Explore effective neural and statistical models for Sinhala text tasks such as sentiment analysis, headline generation, transliteration, and summarization.
- Evaluate and benchmark Sinhala datasets to guide future NLP work for low-resource languages.
- Advance Sinhala-inclusive multilingual and cross-lingual NLP.
Keywords: Natural Language Processing | Machine Learning / Deep Learning | Sinhala | Big Data | Ontologies | Low-resource Languages | Social Media | Multilingual | Sentiment Analysis | LLM | Text Classification | Word Embedding Alignment | Corpus | Neural Machine Translation | Word Vectorization | Measure Alignment | Inflected Languages |
Publications
Dissertations
W.A.S.A Fernando, "Data Augmentation to Induce High Quality Parallel Data for Low-Resource Neural Machine Translation", University of Moratuwa, 2025
Theses: MSc Major Component Research
Menan Velayuthan, "Multi-Domain Neural Machine Translation with Knowledge Distillation For Low Resource Languages", University of Moratuwa, 2025
Theses: MSc Minor Component Research
Pubudu Cooray, "Headline Generation for Sinhala Newspaper Articles using Pre-trained Language Models", University of Moratuwa, 2025
Kasun Wickramasinghe, "Bilingual Lexicon Induction for the Sinhala-English Language Pair", University of Moratuwa, 2024
Journal Papers
W M Yomal De Mel and Nisansa de Silva, "Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study", ICTer, vol. 18, no. 2, pp. 121-130, 2025. doi: 10.4038/icter.v18i2.7299
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Facebook for Sentiment Analysis: Baseline Models to Predict Facebook Reactions of Sinhala Posts", The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022. doi: 10.4038/icter.v15i2.7248
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, and others, "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets", Transactions of the Association for Computational Linguistics, vol. 10, pp. 50--72, 2022. doi: 10.1162/tacl_a_00447
Conference Papers
Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, and Mokanarangan Thayaparan, "Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11216992

H W K Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, and Rishemjit Kaur, "SinLlama-A Large Language Model for Sinhala", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11217094
Aloka Fernando, Nisansa de Silva, Menan Velayuthan, Charitha Rathnayaka, and Surangika Ranathunga, "Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics", in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 28252--28269. doi: 10.18653/v1/2025.emnlp-main.1435
Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, and Surangika Ranathunga, "Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches", in Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, 2025, pp. 166--173.
Kushan Hewapathirana, Nisansa de Silva, and C D Athuraliya, "M2DS: Multilingual Dataset for Multi-document Summarisation", in International Conference on Computational Collective Intelligence, 2024, pp. 219--231. doi: 10.1007/978-3-031-70248-8_17
Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, and Aloka Fernando, "Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research", in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, and Charitha Rathnayake, "Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora", in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian{'}s, Malta: Association for Computational Linguistics, mar. 2024, pp. 860--880.

Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language", in Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, 2023, pp. 424--435.
Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Parallel Word Dictionary Dataset", in 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), 2023, pp. 61--66. doi: 10.1109/ICIIS58898.2023.10253560
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages", in Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325--336. doi: 10.48550/ARXIV.2210.14472

Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data", in 2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering, Association for Computing Machinery, 2022, pp. 16-22. doi: 10.1145/3512826.3512829
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Seeking Sinhala Sentiment: Predicting Facebook Reactions of Sinhala Posts", in 2021 21st International Conference on Advances in ICT for Emerging Regions (ICter), 2021, pp. 177-182. doi: 10.1109/ICter53630.2021.9774796

Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. de Silva, and Gihan Dias, "Comparison Between Performance of Various Database Systems for Implementing a Language Corpus", in International Conference: Beyond Databases, Architectures and Structures, May. 2015, pp. 82--91. doi: 10.1007/978-3-319-18422-7_7
Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. De Silva, and Gihan Dias, "Implementing a Corpus for Sinhala Language", in Symposium on Language Technology for South Asia 2015, 2015. doi: 10.13140/RG.2.2.23035.11047
Indeewari Wijesiri, Malaka Gallage, Buddhika Gunathilaka, Madhuranga Lakjeewa, Daya Wimalasuriya, Gihan Dias, Rohini Paranavithana, and Nisansa de Silva, "Building a WordNet for Sinhala", in Proceedings of the Seventh Global WordNet Conference, January. 2014, pp. 100--108.
Workshop Papers
Menan Velayuthan, Nisansa De Silva, and Surangika Ranathunga, "Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation", in Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), 2025, pp. 161--170.
Menan Velayuthan, Dilith Jayakody, Nisansa de Silva, Aloka Fernando, and Surangika Ranathunga, "Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods", in Proceedings of the Ninth Conference on Machine Translation, 2024, pp. 901--907. doi: 10.18653/v1/2024.wmt-1.87
White Papers
Yudhanjaya Wijeratne and Nisansa de Silva, "Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook", arXiv preprint arXiv:2007.07884, 2020. doi: 10.2139/ssrn.3650976
Yudhanjaya Wijeratne, Nisansa de Silva, and Yashothara Shanmugarajah, "Natural Language Processing for Government: Problems and Potential", LIRNEasia, 2019. doi: 10.13140/RG.2.2.34297.31845
Preprints
Nevidu Jayatilleke and Nisansa de Silva, "SiDiaC: Sinhala Diachronic Corpus", arXiv preprint arXiv:2509.17912, 2025
Nevidu Jayatilleke and Nisansa de Silva, "Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil", arXiv preprint arXiv:2507.18264, 2025
Nisansa de Silva, "Survey on Publicly Available Sinhala Natural Language Processing Tools and Research", arXiv preprint arXiv:1906.02358, 2019. doi: 10.48550/ARXIV.1906.02358
Nisansa de Silva, "Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language", , 2015
Team
External Collaborators: | Yudhanjaya Wijeratne | Surangika Ranathunga | Mokanarangan Thayaparan | Rishemjit Kaur | C D Athuraliya | Chinthana Wimalasuriya | Gihan Dias |
Faculty
MSc Students
Undergraduates
Alumni-PhD Students
Alumni-MSc Students
Alumni-Undergraduates
Grants



























