Professional Experience

  • Present 2020

    Senior Lecturer

    Department of Computer science & Engineering, University of Moratuwa,
    Sri Lanka

  • 2021 2020

    Research Fellow

    LIRNEasia,
    Sri Lanka

  • 2020 2014

    Graduate Research/Teaching Fellow

    University of Oregon, Department of Computer and Information Science,
    USA.

  • 2018 2018

    Givens Associate

    Argonne National Laboratory,
    USA.

  • 2020 2011

    Lecturer

    Department of Computer science & Engineering, University of Moratuwa,
    Sri Lanka

  • 2014 2013

    Researcher

    LIRNEasia,
    Sri Lanka

  • 2014 2013

    Visiting Lecturer

    Northshore College of Business and Technology,
    Sri Lanka

Education

  • Ph.D. 2020

    Ph.D. in Computer & Information Science

    University of Oregon, USA

  • MS 2016

    MS in Computer & Information Science

    University of Oregon, USA

  • BSc2011

    B.Sc Engineering (Hons)in Computer Science & Engineering

    University of Moratuwa, Sri Lanka

Featured Research

Sinhala-English Parallel Word Dictionary Dataset


K. Wickramasinghe, and N. de Silva

2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), IEEE, 2023, pp. 61--66,

Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages. In this paper, we explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict.