Nisansa de Silva

Nisansa Dilushan de Silva

I am a lecturer at the Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka. I obtained my Ph.D (2020, advisor: Dejing Dou) and MS (2016) degrees in Computer and Information Science from University of Oregon, USA and my BSc (Hons) in Computer Science & Engineering (2011) degree from University of Moratuwa. I joined University of Moratuwa as a lecturer in 2011. From 2013 to 2014, I worked as a researcher at LIRNEasia. In 2018, I worked as a Givens Associate at Argonne National Laboratory, USA. I have over 40 peer reviewed publications in the field of computer science, mostly under the subfields of Natural Language Processing and Artificial Intelligence, earning more than 400 citations. I am also an associate member of the IESL.

Nisansa de Silva ( Dr. Nisansa Dilushan de Silva Ph.D.), NisansaDdS

w, e

Computer Scientist

(Get IEEE bio)

Professional Experience

Present 2020

Senior Lecturer

Department of Computer science & Engineering, University of Moratuwa,
Sri Lanka
2021 2020

Research Fellow

LIRNEasia,
Sri Lanka
2020 2014

Graduate Research/Teaching Fellow

University of Oregon, Department of Computer and Information Science,
USA.
2018 2018

Givens Associate

Argonne National Laboratory,
USA.
2020 2011

Lecturer

Department of Computer science & Engineering, University of Moratuwa,
Sri Lanka
2014 2013

Researcher

LIRNEasia,
Sri Lanka
2014 2013

Visiting Lecturer

Northshore College of Business and Technology,
Sri Lanka

Education

Ph.D. 2020

Ph.D. in Computer & Information Science

University of Oregon, USA
MS 2016

MS in Computer & Information Science

University of Oregon, USA
BSc2011

B.Sc Engineering (Hons)in Computer Science & Engineering

University of Moratuwa, Sri Lanka

Featured Research

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

I. Caswell, J. Kreutzer, L. Wang, A. Wahab, D. van Esch, N. Ulzii-Orshikh, A. Tapo, N. Subramani, A. Sokolov, C. Sikasote, et al.

arXiv preprint arXiv:2103.12028, 2021,

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50\% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

Nisansa de Silva

University of Moratuwa

Nisansa Dilushan de Silva

Professional Experience

Senior Lecturer

Research Fellow

Graduate Research/Teaching Fellow

Givens Associate

Lecturer

Researcher

Visiting Lecturer

Education

Featured Research

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets