HomePublications ➤ de2025sinhala

Sinhala Diachronic Corpus

Nisansa de Silva
NeurIPS 2025 AI for Science Workshop

We propose to build a comprehensive diachronic corpus for the Sinhala language. Sinhala is the native language of Sri Lanka, where it has official language status. This corpus is aimed at addressing the notable gap in available linguistic resources for the language, which has a rich literary heritage yet remains under-represented in research. This work would be beneficial mainly to the Sinhala-speaking community, which numbers approximately 16 million people, clearly denoting an underserved minority in the global language ecosystem. Further, researchers focusing on the historical evolution of other related languages, such as Dhivehi, Marathi, Sanskrit, and Hindi, may use this corpus as an auxiliary source in their work. We hope to compile a diverse dataset that spans various time periods and genres, drawing from a wide array of sources, including online content, books, articles, and newspapers. Given that Sinhala has gone through a significant number of evolutionary eras since its origins in the 3rd to 2nd centuries BCE, we intend to collaborate with Sri Lankan institutions and language scholars to annotate texts based on their original writing dates, as opposed to publication dates. Ultimately, following its open access release, this corpus will not only advance the understanding of the linguistic evolution of Sinhala but also contribute to the preservation of its literary legacy.

Keywords: Sinhala | Natural Language Processing |