HomeProjects

Sinhala Language Resources

Principal Investigator: Nisansa de Silva

We propose to build and analyze Sinhala language resources, datasets, and models to strengthen NLP research for this low-resource language, while connecting it to English and Tamil through bilingual lexicons, embeddings, and cross-lingual methods.

Sinhala, spoken by over 17 million people, remains an under-resourced language in NLP. This project aims to address this gap by systematically creating and analyzing resources and models tailored to Sinhala and its interaction with English and Tamil. We propose to compile and curate large-scale corpora from diverse domains such as news, social media, and YouTube comments, while building structured resources such as bilingual lexicons, WordNet expansions, and diachronic corpora.
On the modeling side, we will explore headline generation, transliteration, multi-document summarization, sentiment prediction, and classification methods. Special emphasis will be given to embedding alignment between Sinhala and English, enabling transfer learning and benchmarking for cross-lingual tasks. We will also investigate zero-shot OCR for Sinhala and Tamil, and experiment with adapting multilingual encoders and LLMs (e.g., SinLlama) for Sinhala.
The outcome of this work will be a comprehensive ecosystem of datasets, tools, and models that will not only advance Sinhala NLP but also benefit multilingual research across low-resource languages globally.

Objectives:

  • Build foundational linguistic resources for Sinhala, such as corpora and lexicons.
  • Conduct large-scale corpus studies on Sinhala text.
  • Explore effective neural and statistical models for Sinhala text tasks such as sentiment analysis, headline generation, transliteration, and summarization.
  • Evaluate and benchmark Sinhala datasets to guide future NLP work for low-resource languages.
  • Advance Sinhala-inclusive multilingual and cross-lingual NLP.


Keywords: Natural Language Processing | Machine Learning / Deep Learning | Sinhala | Big Data | Ontologies | Low-resource Languages | Social Media | Multilingual | Sentiment Analysis | LLM | Text Classification | Word Embedding Alignment | Corpus | Neural Machine Translation | Word Vectorization | Measure Alignment | Inflected Languages |




Publications

Dissertations

Theses: MSc Major Component Research

Theses: MSc Minor Component Research

Journal Papers

Conference Papers

Workshop Papers

White Papers

Preprints

Team

External Collaborators: | Yudhanjaya Wijeratne | Surangika Ranathunga | Mokanarangan Thayaparan | Rishemjit Kaur | C D Athuraliya | Chinthana Wimalasuriya | Gihan Dias |


Faculty

Nisansa de Silva

Senior Lecturer
University of Moratuwa

MSc Students

Charitha Rathnayake

Lecture on Contract
University of Moratuwa

Nevidu Jayatilleke

Research Assistant (Assistant Lecturer Grade)
Informatics Institute of Technology

Yomal De Mel

Manager Finance
MAS Active

Undergraduates

Imalsha Puranegedara

Student
University of Moratuwa

Nisal Ranathunga

Student
University of Moratuwa

Rashad Sirajudeen

Student
University of Moratuwa

Samith Karunathilake

Software Engineer
WSO2

Themira Chathumina

Student
University of Moratuwa

Alumni-PhD Students

Aloka Fernando

Researcher / Visiting Lecturer
Informatics Institute of Technology

Alumni-MSc Students

Kasun Wickramasinghe

AI Research Engineer
Analog Inference

Kushan Hewapathirana

Machine Learning Engineer
ConscientAI

Pubudu Cooray

Lead Software Engineer
Insighture

Velayuthan Menan

AI Research Engineer
University of Moratuwa

Alumni-Undergraduates

Aravinda Kankanamge

Software Engineer Fellow
Lanka Software Foundation

Buddhika Gunathilaka

Software Engineer
Harlem Next

Dilith Jayakody

Graduate Student
Dalhousie University

Dimuthu Upeksha

Director of Engineering
Folia

Gihan Weeraprameshwara

Ph.D. Student
Michigan State University

Indeewari Wijesiri

Associate Technical Lead
WSO2

Lahiru Lasandun

Senior Technical Lead
SenzMate

Madhuranga Lakjeewa

Software Engineer
Automic Group

Maduranga Siriwardena

Associate Director / Architect
WSO2

Malaka Gallage

Senior Full Stack Developer
Hitachi Energy

Vihanga Jayawickrama

Lecturer (on Contract)
University of Moratuwa

Grants

2022
08
-
2023
08
Multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English
$35,000 - Google/2022
We propose to create a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English, the official languages of Sri Lanka.