Multilingual Word Embedding Alignment for Sinhala (Defence)
Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner yet monolingual embedding alignment techniques are used in low-resource languages like Sinhala. Our main focus here is to improve the Sinhala word embedding alignment. Here in this research, we experiment with the available monolingual embedding alignment techniques to have the best Sinhala-English embedding alignment that have been achieved so far. For that, first we introduce a large-scale Sinhala-English word-level parallel dictionary that facilitates any word-level cross-lingual tasks. Next, we align Sinhala and English embedding spaces using the available embedding alignment techniques. During the experiments, we found a novel alignment dataset creation technique and two novel extensions to Bilingual Lexicon Induction that give a more informative measure of the degree of alignment. We extend our experiments to 8 other languages as well and prove the validity of our methods. In addition to that using all the languages we conduct a comparative study about the quality of the aligned mono-lingual embeddings, multilingual embeddings, and hybrid-aligned embeddings. We have published two conference papers during the research so far and we have released all the resources in GitHub to open-source usage for the research community.