Low-Resource Adaptation for NLP
Principal Investigator: Nisansa de Silva
We propose to advance natural language processing for low-resource settings by leveraging data augmentation, transfer learning, and model adaptation techniques to overcome the scarcity of annotated data and linguistic resources.
Low-resource natural language processing (NLP) remains a key challenge in enabling equitable access to language technologies across diverse linguistic communities. The lack of high-quality annotated data and robust resources often limits the applicability of modern NLP techniques to many languages and tasks. This project seeks to address these challenges by systematically developing methods for data creation, augmentation, and model adaptation.
We explore strategies such as data augmentation, denoising, and debiasing of parallel corpora to improve training material. Cross-lingual lexicon induction, embedding alignment, and resource development are employed to establish strong linguistic foundations. On the modeling side, we focus on knowledge distillation, adapter-based fine-tuning, and the efficient adaptation of large pre-trained models to perform in low-resource contexts.
Additionally, we emphasize building reproducible datasets and benchmarks, supporting multi-domain learning, and evaluating comparative approaches across tasks such as sentiment analysis, summarization, classification, and generation. We also investigate zero-shot and few-shot settings, extending the benefits of large-scale multilingual encoders and large language models to under-represented languages.
Through this integrated approach, the project contributes towards reducing the performance gap between resource-rich and low-resource NLP, ensuring broader accessibility and inclusivity in language technologies.
Objectives:
- Investigate data augmentation and denoising strategies to enhance the quality of training data for low-resource NLP tasks.
- Develop cross-lingual lexicon induction, embedding alignment, and resource creation methods to bridge gaps between resource-rich and low-resource languages.
- Explore multi-domain and multi-task approaches that improve model generalization in low-resource contexts.
- Apply transfer learning, knowledge distillation, and adapter-based fine-tuning to adapt large pre-trained models efficiently.
- Build datasets, benchmarks, and evaluation frameworks to support reproducibility and scalability in low-resource NLP research.
- Examine zero-shot and few-shot methods to extend model capabilities to languages or tasks with minimal supervision.
- Advance sentiment analysis, summarization, and generation methods that can perform effectively despite limited resources.
Keywords: Natural Language Processing | Machine Learning / Deep Learning | Sinhala | Big Data | Ontologies | LLM | Low-resource Languages | Sentiment Analysis | Aspect-based Sentiment Analysis | Social Media | Multilingual | Multi-document summarization | Mobile App Review Analysis | Software Evolution | Word Embeddings | Word Vectorization | Fine-tuning | Word Embedding Alignment | BLI | Alignment Dictionaries | WordNet | Language-agnostic Processing | Textual Reviews | Multilingual Embedding | Inflected Languages | Measure Alignment | Corpus | Aspect Extraction | DeBERTa | Text Classification | GPT | InstructABSA | Text Generation |
Publications
Dissertations
W.A.S.A Fernando, "Data Augmentation to Induce High Quality Parallel Data for Low-Resource Neural Machine Translation", University of Moratuwa, 2025
Theses: MSc Major Component Research
Menan Velayuthan, "Multi-Domain Neural Machine Translation with Knowledge Distillation For Low Resource Languages", University of Moratuwa, 2025
Theses: MSc Minor Component Research
Pubudu Cooray, "Headline Generation for Sinhala Newspaper Articles using Pre-trained Language Models", University of Moratuwa, 2025
Kushan Hewapathirana, "Towards Multi-document Summarisation in Low-resource Settings", University of Moratuwa, 2025
Sadeep Gunathilaka, "Automated User Review Analysis To Facilitate Potential Mobile Application Evolution", University of Moratuwa, 2025
Kasun Wickramasinghe, "Bilingual Lexicon Induction for the Sinhala-English Language Pair", University of Moratuwa, 2024
Journal Papers
W M Yomal De Mel and Nisansa de Silva, "Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study", ICTer, vol. 18, no. 2, pp. 121-130, 2025. doi: 10.4038/icter.v18i2.7299
Dineth Jayakody, A V A Malkith, Koshila Isuranda, Vishal Thenuwara, Nisansa de Silva, Sachintha Rajith Ponnamperuma, G G N Sandamali, and K L K Sudheera, "Instruct-DeBERTa: A Hybrid Approach for Aspect-based Sentiment Analysis on Textual Reviews", ICTer, vol. 18, no. 2, pp. 39-50, 2025. doi: 10.4038/icter.v18i2.7290
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Facebook for Sentiment Analysis: Baseline Models to Predict Facebook Reactions of Sinhala Posts", The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022. doi: 10.4038/icter.v15i2.7248
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, and others, "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets", Transactions of the Association for Computational Linguistics, vol. 10, pp. 50--72, 2022. doi: 10.1162/tacl_a_00447
Conference Papers
Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, and Mokanarangan Thayaparan, "Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11216992

H W K Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, and Rishemjit Kaur, "SinLlama-A Large Language Model for Sinhala", in Moratuwa Engineering Research Conference (MERCon), 2025. doi: 10.1109/MERCon67903.2025.11217094
Aloka Fernando, Nisansa de Silva, Menan Velayuthan, Charitha Rathnayaka, and Surangika Ranathunga, "Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics", in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 28252--28269. doi: 10.18653/v1/2025.emnlp-main.1435
Sadeep Gunathilaka and Nisansa de Silva, "Automatic Analysis of App Reviews Using LLMs", in Proceedings of the Conference on Agents and Artificial Intelligence, 2025, pp. 828-839. doi: 10.5220/0013375600003890

Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, and Surangika Ranathunga, "Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches", in Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, 2025, pp. 166--173.
Dineth Jayakody, Koshila Isuranda, A V A Malkith, Nisansa de Silva, Sachintha Rajith Ponnamperuma, G G N Sandamali, K L K Sudheera, and Kashnika Gimhani Sarathchandra, "Enhanced Aspect-Based Sentiment Analysis with Integrated Category Extraction for Instruct-DeBERTa", in Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, 2024, pp. 665--674.
Dineth Jayakody, Koshila Isuranda, A V A Malkith, Nisansa de Silva, Sachintha Rajith Ponnamperuma, G G N Sandamali, and K L K Sudheera, "Aspect-Based Sentiment Analysis Techniques: A Comparative Study", in 2024 Moratuwa Engineering Research Conference (MERCon), 2024, pp. 205--210. doi: 10.1109/MERCon63886.2024.10688631
Kushan Hewapathirana, Nisansa de Silva, and C D Athuraliya, "M2DS: Multilingual Dataset for Multi-document Summarisation", in International Conference on Computational Collective Intelligence, 2024, pp. 219--231. doi: 10.1007/978-3-031-70248-8_17
Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, and Aloka Fernando, "Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research", in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, and Charitha Rathnayake, "Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora", in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian{'}s, Malta: Association for Computational Linguistics, mar. 2024, pp. 860--880.

Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language", in Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, 2023, pp. 424--435.
Kasun Wickramasinghe and Nisansa de Silva, "Sinhala-English Parallel Word Dictionary Dataset", in 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), 2023, pp. 61--66. doi: 10.1109/ICIIS58898.2023.10253560
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages", in Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325--336. doi: 10.48550/ARXIV.2210.14472

Sadeep Gunathilaka and Nisansa De Silva, "Aspect-based Sentiment Analysis on Mobile Application Reviews", in 2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer), 2022, pp. 183--188. doi: 10.1109/ICTer58063.2022.10024070
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data", in 2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering, Association for Computing Machinery, 2022, pp. 16-22. doi: 10.1145/3512826.3512829
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Seeking Sinhala Sentiment: Predicting Facebook Reactions of Sinhala Posts", in 2021 21st International Conference on Advances in ICT for Emerging Regions (ICter), 2021, pp. 177-182. doi: 10.1109/ICter53630.2021.9774796

Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. de Silva, and Gihan Dias, "Comparison Between Performance of Various Database Systems for Implementing a Language Corpus", in International Conference: Beyond Databases, Architectures and Structures, May. 2015, pp. 82--91. doi: 10.1007/978-3-319-18422-7_7
Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. De Silva, and Gihan Dias, "Implementing a Corpus for Sinhala Language", in Symposium on Language Technology for South Asia 2015, 2015. doi: 10.13140/RG.2.2.23035.11047
U. L. D. N. Gunasinghe, W. A. M. De Silva, N. H. N. D. de Silva, A. S. Perera, W. A. D. Sashika, and W. D. T. P. Premasiri, "Sentence Similarity Measuring by Vector Space Model", in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on, December. 2014, pp. 185--189. doi: 10.1109/ICTER.2014.7083899
Indeewari Wijesiri, Malaka Gallage, Buddhika Gunathilaka, Madhuranga Lakjeewa, Daya Wimalasuriya, Gihan Dias, Rohini Paranavithana, and Nisansa de Silva, "Building a WordNet for Sinhala", in Proceedings of the Seventh Global WordNet Conference, January. 2014, pp. 100--108.
E. L. Karannagoda, H. M. T. C. Herath, K. N. J. Fernando, M. W. I. D. Karunarathne, N. H. N. D. de Silva, and A. S. Perera, "Document Analysis Based Automatic Concept Map Generation for Enterprises", in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on, December. 2013, pp. 154--159. doi: 10.1109/ICTer.2013.6761171
Workshop Papers
Menan Velayuthan, Nisansa De Silva, and Surangika Ranathunga, "Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation", in Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), 2025, pp. 161--170.
Dineth Jayakody, Koshila Isuranda, A V A Malkith, Nisansa de Silva, G G N Sandamali, K L K Sudheera, and Sachintha Rajith, "Instruct-DeBERTa: A Hybrid Approach for Enhanced Aspect-Based Sentiment Analysis with Category Extraction", in Eighth Widening NLP Workshop (WiNLP 2024) Phase II, 2024.
Menan Velayuthan, Dilith Jayakody, Nisansa de Silva, Aloka Fernando, and Surangika Ranathunga, "Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods", in Proceedings of the Ninth Conference on Machine Translation, 2024, pp. 901--907. doi: 10.18653/v1/2024.wmt-1.87
Extended Abstracts
Kushan Hewapathirana, Nisansa de Silva, and C. D. Athuraliya, "Adapter-based Fine-tuning for PRIMERA", in 1st Applied Data Science \& Artificial Intelligence Symposium, April. 2025.
White Papers
Yudhanjaya Wijeratne and Nisansa de Silva, "Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook", arXiv preprint arXiv:2007.07884, 2020. doi: 10.2139/ssrn.3650976
Yudhanjaya Wijeratne, Nisansa de Silva, and Yashothara Shanmugarajah, "Natural Language Processing for Government: Problems and Potential", LIRNEasia, 2019. doi: 10.13140/RG.2.2.34297.31845
Preprints
Nevidu Jayatilleke and Nisansa de Silva, "SiDiaC: Sinhala Diachronic Corpus", arXiv preprint arXiv:2509.17912, 2025
Nevidu Jayatilleke and Nisansa de Silva, "Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil", arXiv preprint arXiv:2507.18264, 2025
Kavindu Warnakulasuriya, Prabhash Dissanayake, Navindu De Silva, Stephen Cranefield, Bastin Tony Roy Savarimuthu, Surangika Ranathunga, and Nisansa de Silva, "Evolution of Cooperation in LLM-Agent Societies: A Preliminary Study Using Different Punishment Strategies", arXiv preprint arXiv:2504.19487, 2025
Nisansa de Silva, "Survey on Publicly Available Sinhala Natural Language Processing Tools and Research", arXiv preprint arXiv:1906.02358, 2019. doi: 10.48550/ARXIV.1906.02358
Nisansa de Silva, "Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language", , 2015
Team
External Collaborators: | G G N Sandamali | K L K Sudheera | Yudhanjaya Wijeratne | Surangika Ranathunga | Mokanarangan Thayaparan | Rishemjit Kaur | C D Athuraliya | Chinthana Wimalasuriya | Gihan Dias | Shehan Perera | Stephen Cranefield | Bastin Tony Roy Savarimuthu |
Faculty
MSc Students
Undergraduates
Alumni-PhD Students
Alumni-MSc Students
Alumni-Undergraduates
Grants











































