Home ➤ Projects ➤

Low-Resource Adaptation for NLP

Principal Investigator: Nisansa de Silva

We propose to advance natural language processing for low-resource settings by leveraging data augmentation, transfer learning, and model adaptation techniques to overcome the scarcity of annotated data and linguistic resources.

Low-resource natural language processing (NLP) remains a key challenge in enabling equitable access to language technologies across diverse linguistic communities. The lack of high-quality annotated data and robust resources often limits the applicability of modern NLP techniques to many languages and tasks. This project seeks to address these challenges by systematically developing methods for data creation, augmentation, and model adaptation.
We explore strategies such as data augmentation, denoising, and debiasing of parallel corpora to improve training material. Cross-lingual lexicon induction, embedding alignment, and resource development are employed to establish strong linguistic foundations. On the modeling side, we focus on knowledge distillation, adapter-based fine-tuning, and the efficient adaptation of large pre-trained models to perform in low-resource contexts.
Additionally, we emphasize building reproducible datasets and benchmarks, supporting multi-domain learning, and evaluating comparative approaches across tasks such as sentiment analysis, summarization, classification, and generation. We also investigate zero-shot and few-shot settings, extending the benefits of large-scale multilingual encoders and large language models to under-represented languages.
Through this integrated approach, the project contributes towards reducing the performance gap between resource-rich and low-resource NLP, ensuring broader accessibility and inclusivity in language technologies.

Objectives:

Investigate data augmentation and denoising strategies to enhance the quality of training data for low-resource NLP tasks.
Develop cross-lingual lexicon induction, embedding alignment, and resource creation methods to bridge gaps between resource-rich and low-resource languages.
Explore multi-domain and multi-task approaches that improve model generalization in low-resource contexts.
Apply transfer learning, knowledge distillation, and adapter-based fine-tuning to adapt large pre-trained models efficiently.
Build datasets, benchmarks, and evaluation frameworks to support reproducibility and scalability in low-resource NLP research.
Examine zero-shot and few-shot methods to extend model capabilities to languages or tasks with minimal supervision.
Advance sentiment analysis, summarization, and generation methods that can perform effectively despite limited resources.

Publications

Dissertations

W.A.S.A Fernando, "Data Augmentation to Induce High Quality Parallel Data for Low-Resource Neural Machine Translation", University of Moratuwa, 2025

Theses: MSc Major Component Research

Menan Velayuthan, "Multi-Domain Neural Machine Translation with Knowledge Distillation For Low Resource Languages", University of Moratuwa, 2025

Theses: MSc Minor Component Research

Pubudu Cooray, "Headline Generation for Sinhala Newspaper Articles using Pre-trained Language Models", University of Moratuwa, 2025
Kushan Hewapathirana, "Towards Multi-document Summarisation in Low-resource Settings", University of Moratuwa, 2025
Sadeep Gunathilaka, "Automated User Review Analysis To Facilitate Potential Mobile Application Evolution", University of Moratuwa, 2025
Kasun Wickramasinghe, "Bilingual Lexicon Induction for the Sinhala-English Language Pair", University of Moratuwa, 2024

Journal Papers

W M Yomal De Mel and Nisansa de Silva, "Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study", ICTer, vol. 18, no. 2, pp. 121-130, 2025. doi: 10.4038/icter.v18i2.7299
Dineth Jayakody, A V A Malkith, Koshila Isuranda, Vishal Thenuwara, Nisansa de Silva, Sachintha Rajith Ponnamperuma, G G N Sandamali, and K L K Sudheera, "Instruct-DeBERTa: A Hybrid Approach for Aspect-based Sentiment Analysis on Textual Reviews", ICTer, vol. 18, no. 2, pp. 39-50, 2025. doi: 10.4038/icter.v18i2.7290
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Facebook for Sentiment Analysis: Baseline Models to Predict Facebook Reactions of Sinhala Posts", The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022. doi: 10.4038/icter.v15i2.7248
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, and others, "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets", Transactions of the Association for Computational Linguistics, vol. 10, pp. 50--72, 2022. doi: 10.1162/tacl_a_00447

Conference Papers

Kavindu Warnakulasuriya, Prabhash Dissanayake, Navindu De Silva, Stephen Cranefield, Bastin Tony Roy Savarimuthu, Surangika Ranathunga, and Nisansa de Silva, "Evolution of Cooperation in LLM-Agent Societies: A Preliminary Study Using Different Punishment Strategies", in Coordination, Organizations, Institutions, Norms, and Ethics for Governance of Multi-Agent Systems XVIII, Springer Nature Switzerland, 2026, pp. 115--133. doi: 10.1007/978-3-032-17542-7_7
Nevidu Jayatilleke and Nisansa de Silva, "SiDiaC: Sinhala Diachronic Corpus", in Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation, 2025, pp. 511--527.
Nevidu Jayatilleke and Nisansa de Silva, "Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil", in Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Processing in the Generative AI Era, 2025, pp. 471--480.
Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, and Mokanarangan Thayaparan, "Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11216992
H W K Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, and Rishemjit Kaur, "SinLlama-A Large Language Model for Sinhala", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11217094
Aloka Fernando, Nisansa de Silva, Menan Velayuthan, Charitha Rathnayaka, and Surangika Ranathunga, "Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics", in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 28252--28269. doi: 10.18653/v1/2025.emnlp-main.1435
Sadeep Gunathilaka and Nisansa de Silva, "Automatic Analysis of App Reviews Using LLMs", in Proceedings of the Conference on Agents and Artificial Intelligence, 2025, pp. 828-839. doi: 10.5220/0013375600003890
Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, and Surangika Ranathunga, "Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches", in Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, 2025, pp. 166--173.
Dineth Jayakody, Koshila Isuranda, A V A Malkith, Nisansa de Silva, Sachintha Rajith Ponnamperuma, G G N Sandamali, K L K Sudheera, and Kashnika Gimhani Sarathchandra, "Enhanced Aspect-Based Sentiment Analysis with Integrated Category Extraction for Instruct-DeBERTa", in Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, 2024, pp. 665--674.
Dineth Jayakody, Koshila Isuranda, A V A Malkith, Nisansa de Silva, Sachintha Rajith Ponnamperuma, G G N Sandamali, and K L K Sudheera, "Aspect-Based Sentiment Analysis Techniques: A Comparative Study", in 2024 Moratuwa Engineering Research Conference (MERCon), 2024, pp. 205--210. doi: 10.1109/MERCon63886.2024.10688631
Kushan Hewapathirana, Nisansa de Silva, and C D Athuraliya, "M2DS: Multilingual Dataset for Multi-document Summarisation", in International Conference on Computational Collective Intelligence, 2024, pp. 219--231. doi: 10.1007/978-3-031-70248-8_17
Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, and Aloka Fernando, "Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research", in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, and Charitha Rathnayake, "Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora", in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian{'}s, Malta: Association for Computational Linguistics, mar. 2024, pp. 860--880.
Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language", in Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, 2023, pp. 424--435.
Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Parallel Word Dictionary Dataset", in 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), 2023, pp. 61--66. doi: 10.1109/ICIIS58898.2023.10253560
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages", in Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325--336. doi: 10.48550/ARXIV.2210.14472
Sadeep Gunathilaka and Nisansa De Silva, "Aspect-based Sentiment Analysis on Mobile Application Reviews", in 2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer), 2022, pp. 183--188. doi: 10.1109/ICTer58063.2022.10024070
Surangika Ranathunga and Nisansa de Silva, "Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World", in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2022, pp. 823--848. doi: 10.48550/ARXIV.2210.08523
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data", in 2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering, Association for Computing Machinery, 2022, pp. 16-22. doi: 10.1145/3512826.3512829
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Seeking Sinhala Sentiment: Predicting Facebook Reactions of Sinhala Posts", in 2021 21st International Conference on Advances in ICT for Emerging Regions (ICter), 2021, pp. 177-182. doi: 10.1109/ICter53630.2021.9774796
Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. de Silva, and Gihan Dias, "Comparison Between Performance of Various Database Systems for Implementing a Language Corpus", in International Conference: Beyond Databases, Architectures and Structures, May. 2015, pp. 82--91. doi: 10.1007/978-3-319-18422-7_7
Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. De Silva, and Gihan Dias, "Implementing a Corpus for Sinhala Language", in Symposium on Language Technology for South Asia 2015, 2015. doi: 10.13140/RG.2.2.23035.11047
U. L. D. N. Gunasinghe, W. A. M. De Silva, N. H. N. D. de Silva, A. S. Perera, W. A. D. Sashika, and W. D. T. P. Premasiri, "Sentence Similarity Measuring by Vector Space Model", in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on, December. 2014, pp. 185--189. doi: 10.1109/ICTER.2014.7083899
Indeewari Wijesiri, Malaka Gallage, Buddhika Gunathilaka, Madhuranga Lakjeewa, Daya Wimalasuriya, Gihan Dias, Rohini Paranavithana, and Nisansa de Silva, "Building a WordNet for Sinhala", in Proceedings of the Seventh Global WordNet Conference, January. 2014, pp. 100--108.
E. L. Karannagoda, H. M. T. C. Herath, K. N. J. Fernando, M. W. I. D. Karunarathne, N. H. N. D. de Silva, and A. S. Perera, "Document Analysis Based Automatic Concept Map Generation for Enterprises", in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on, December. 2013, pp. 154--159. doi: 10.1109/ICTer.2013.6761171

Workshop Papers

Menan Velayuthan, Nisansa De Silva, and Surangika Ranathunga, "Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation", in Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), 2025, pp. 161--170.
Dineth Jayakody, Koshila Isuranda, A V A Malkith, Nisansa de Silva, G G N Sandamali, K L K Sudheera, and Sachintha Rajith, "Instruct-DeBERTa: A Hybrid Approach for Enhanced Aspect-Based Sentiment Analysis with Category Extraction", in Eighth Widening NLP Workshop (WiNLP 2024) Phase II, 2024.
Menan Velayuthan, Dilith Jayakody, Nisansa de Silva, Aloka Fernando, and Surangika Ranathunga, "Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods", in Proceedings of the Ninth Conference on Machine Translation, 2024, pp. 901--907. doi: 10.18653/v1/2024.wmt-1.87

Extended Abstracts

Kushan Hewapathirana, Nisansa de Silva, and C. D. Athuraliya, "Adapter-based Fine-tuning for PRIMERA", in 1st Applied Data Science \& Artificial Intelligence Symposium, April. 2025. doi: 10.31705/ADScAI.2025.57

White Papers

Yudhanjaya Wijeratne and Nisansa de Silva, "Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook", arXiv preprint arXiv:2007.07884, 2020. doi: 10.2139/ssrn.3650976
Yudhanjaya Wijeratne, Nisansa de Silva, and Yashothara Shanmugarajah, "Natural Language Processing for Government: Problems and Potential", LIRNEasia, 2019. doi: 10.13140/RG.2.2.34297.31845

Preprints

Nisansa de Silva, "Survey on Publicly Available Sinhala Natural Language Processing Tools and Research", arXiv preprint arXiv:1906.02358, 2019. doi: 10.48550/arXiv.1906.02358
Nisansa de Silva, "Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language", , 2015

Team

Faculty

Nisansa de Silva

Senior Lecturer

University of Moratuwa

MSc Students

Charitha Rathnayake

Lecture on Contract

University of Moratuwa

Nevidu Jayatilleke

Research Assistant (Assistant Lecturer Grade)

Informatics Institute of Technology

Vishal Thenuwara

Software Engineer

Amused Group

Yomal De Mel

Manager Finance

MAS Active

Undergraduates

Imalsha Puranegedara

Student

University of Moratuwa

Kavindu Warnakulasuriya

Student

University of Moratuwa

Prabhash Dissanayake

Student

University of Moratuwa

Alumni-PhD Students

Aloka Fernando

Researcher / Visiting Lecturer

Informatics Institute of Technology

Alumni-MSc Students

Kasun Wickramasinghe

AI Research Engineer

Analog Inference

Kushan Hewapathirana

Machine Learning Engineer

ConscientAI

Pubudu Cooray

Lead Software Engineer

Insighture

Sadeep Gunathilaka

Software Engineer

Inexis Consulting

Velayuthan Menan

AI Research Engineer

University of Moratuwa

Alumni-Undergraduates

Amanda Malkith

Software Engineer

Cut+Dry

Anushka Mahesh

Senior Fullstack Engineer

Healthcare Clarity

Aravinda Kankanamge

Software Engineer Fellow

Lanka Software Foundation

Buddhika Gunathilaka

Software Engineer

Harlem Next

Chamila Wijayarathna

Software Engineer

Trovio

Dilith Jayakody

Graduate Student

Dalhousie University

Dimuthu Upeksha

Director of Engineering

Folia

Dineth Jayakody

Ph.D. Student

Old Dominion University

Dulanga Sashika

Senior Consultant

Visa

Eranda Karannagoda

Software Engineer

Huubap PTE Ltd

Gihan Weeraprameshwara

Ph.D. Student

Michigan State University

Indeewari Wijesiri

Associate Technical Lead

WSO2

Jayaprabath Fernando

R&D Engineer

Syntax Genie (Pvt) Ltd

Koshila Isuranda

Software Engineer

Emojot

Lahiru Lasandun

Senior Technical Lead

SenzMate

Madhuranga Lakjeewa

Software Engineer

Automic Group

Maduranga Siriwardena

Associate Director / Architect

WSO2

Malaka Gallage

Senior Full Stack Developer

Hitachi Energy

Nadeeshaan Gunasinghe

Expert Software Engineer

Zühlke Group

Navindu De Silva

Graduate Student

National University of Singapore

Nisal Ranathunga

Software Engineer

Yaala Labs

Rashad Sirajudeen

Software Engineer

WSO2

Sachintha Rajith

Co-founder and Chief Technology Officer

Emojot

Samith Karunathilake

Software Engineer

WSO2

Themira Chathumina

Software Engineer

Sysco Labs

Thilina Chathuranga

Principal Software Engineer

Fleetwise New Zealand

Thilina Premasiri

Senior Technical Artist

Skybox Labs

Vihanga Jayawickrama

Lecturer (on Contract)

University of Moratuwa

Grants

This project was partially supported by the following grants:

2022

2025

Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English

$35,000 - Google/2022

We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka.

View grant