HomeReseach Talks ➤ 112 25 09 2024

Improving cross-lingual representations of multilingual LLM

Aravinda Kankanamge, Samith Karunathilake, Isbahan Rashad
Slides Video

Natural language processing has advanced significantly as a result of the development of Multilingual Large Language Models (MLLMs), but these gains have not been fairly spread, with high-resource languages like English and Chinese receiving the most attention. Low-resource languages like Sinhala, Punjabi and Tamil, are underrepresented as a result of this discrepancy, which makes it difficult for models to function well for these languages. By examining several methods to improve MLLM performance for low-resource languages, with a particular emphasis on preference optimization, tokenization retraining, and fine-tuning, this work aims to close this gap. With a focus on South Asian languages, our study seeks to develop more inclusive models that are better equipped to manage the linguistic difficulties of this region. Using the LLaMA 3 model, we have benchmarked Sinhala and Punjabi successfully thus far, providing important insights into the obstacles and prospects for enhancing model performance in low-resource environments. Extending these to other approaches that we uncovered is an ongoing project. It is anticipated that the findings of this study will lead to more equitable AI technologies, guaranteeing that language modeling advances will benefit and be accessible to speakers of all languages, especially those in underrepresented linguistic communities.

    Page: /