HomePublications ➤ upeksha2015sinmin

Sinmin - Sinhala Corpus Project

Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun
Advised by: Daya Chinthana Wimalasuriya, Gihan Dias, Nisansa de Silva
University of Moratuwa

Today, the corpus based approach can be identified as the state of the art methodology in language learning studying for both prominent and less known languages in the world. The corpus based approach mines new knowledge on a language by answering two main questions:

  • What particular patterns are associated with lexical or grammatical features of the language?
  • How do these patterns differ within varieties and registers?
A language corpus can be identified as a collection of authentic texts that are stored electronically. It contains different language patterns in different genres, time periods and social variants. Most of the major languages in the world have their own corpora. But corpora which have been implemented for Sinhala language have so many limitations.

SinMin is a corpus for Sinhala language which is
  • Continuously updating
  • Dynamic (Scalable)
  • Covers wide range of language (Structured and unstructured)
  • Providing a better interface for users to interact with the corpus
This report contains the comprehensive literature review done and the research, and design and implementation details of the SinMin corpus. The implementation details are organized according to the various components of the platform. Testing, and future works have been discussed towards the end of this report.

Keywords: Natural Language Processing | Sinhala | Big Data |