Social Media Text Analysis
Principal Investigator: Nisansa de Silva
We propose to advance social media text analysis by combining linguistic study, preprocessing, and deep learning models to better capture sentiment, structure, and meaning in informal, large-scale user-generated text.
Social media platforms produce vast amounts of user-generated text that is often informal, noisy, and linguistically diverse. Analyzing this text presents unique challenges for natural language processing, including spelling variations, non-standard grammar, and rapid shifts in language use. This project seeks to address these challenges by developing methods and resources for effective social media text analysis.
Our research investigates the linguistic properties of social media text and applies normalization techniques to reduce noise and improve downstream model performance. Sentiment analysis and reaction prediction methods are explored, ranging from baseline classifiers to advanced deep learning approaches, enabling fine-grained understanding of user opinions. Embedding-based techniques are employed to represent short and informal text effectively, with attention to the needs of low-resource settings.
We also emphasize dataset creation and large-scale corpus studies, including temporal analyses of user-generated content, to provide insights into evolving linguistic and sentiment patterns. By integrating linguistic analysis, preprocessing, and modern modeling approaches, this work contributes to more robust, accurate, and scalable systems for social media text analysis.
Objectives:
- Investigate linguistic properties and stylistic patterns in social media text to better understand its unique characteristics.
- Develop preprocessing and normalization techniques tailored for noisy, user-generated content.
- Explore sentiment analysis and reaction prediction models to capture opinions and attitudes expressed in social media.
- Design and evaluate embedding-based approaches for representing short, informal, and multilingual text.
- Build datasets and benchmarks derived from large-scale social media corpora to facilitate reproducible research.
- Examine temporal and large-scale patterns in social media data to support longitudinal linguistic and sentiment studies.
Keywords: Natural Language Processing | Sinhala | Big Data | Machine Learning / Deep Learning |
Publications
Journal Papers
W M Yomal De Mel and Nisansa de Silva, "Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study", ICTer, vol. 18, no. 2, pp. 121-130, 2025. doi: 10.4038/icter.v18i2.7299
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Facebook for Sentiment Analysis: Baseline Models to Predict Facebook Reactions of Sinhala Posts", The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022. doi: 10.4038/icter.v15i2.7248
Eranga Mapa, Lasitha Wattaladeniya, Chiran Chathuranga, Samith Dassanayake, Nisansa de Silva, Upali Kohomban, and Danaja Maldeniya, "Text Normalization in Social Media by using Spell Correction and Dictionary Based Approach", Systems Learning, vol. 1, pp. 1--6, 2012
Conference Papers
Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages", in Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325--336. doi: 10.48550/ARXIV.2210.14472

Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, and Yudhanjaya Wijeratne, "Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data", in 2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering, Association for Computing Machinery, 2022, pp. 16-22. doi: 10.1145/3512826.3512829
Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, and Yudhanjaya Wijeratne, "Seeking Sinhala Sentiment: Predicting Facebook Reactions of Sinhala Posts", in 2021 21st International Conference on Advances in ICT for Emerging Regions (ICter), 2021, pp. 177-182. doi: 10.1109/ICter53630.2021.9774796

White Papers
Yudhanjaya Wijeratne and Nisansa de Silva, "Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook", arXiv preprint arXiv:2007.07884, 2020. doi: 10.2139/ssrn.3650976
Preprints
Nisansa de Silva, Danaja Maldeniya, and Chamilka Wijeratne, "Subject Specific Stream Classification Preprocessing Algorithm for Twitter Data Stream", arXiv preprint arXiv:1705.09995, 2017. doi: 10.48550/ARXIV.1705.09995
Team
External Collaborators: | Yudhanjaya Wijeratne | Upali Kohomban | Danaja Maldeniya | Chamilka Wijeratne |








