HomePublications ➤ jayasinghe2022learning

Learning Sentence Embeddings in the Legal Domain with Low Resource Settings

Sahan Jayasinghe, Lakith Rambukkanage, Ashan Silva, Nisansa de Silva, Shehan Perera, Madhavi Perera
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

As Natural Language Processing is evolving rapidly, it is used to analyze domain specific large text corpora. Applying Natural Language Processing in a domain with uncommon vocabulary and unique semantics requires techniques specifically designed for that domain. The legal domain is such an area with unique vocabulary and semantic interpretations. In this paper we have conducted research to develop sentence embeddings, specifically for the legal domain, to address the domain needs. We have carried this research under two approaches. Due to the availability of a large corpus of raw court case documents, an Auto-Encoder model which re-constructs the input sentence is trained in a self-supervised approach. Pretrained word embeddings on general corpora and word embeddings specifically trained on legal corpora are also incorporated within the Auto-Encoder. As the next approach we have designed a multitask model with noise discrimination and Semantic Textual Similarity tasks. It is expected that these embeddings and gained insights would help vectorize legal domain corpora, enabling further application of Machine Learning in the legal domain.

Keywords: Natural Language Processing | Machine Learning / Deep Learning | Law |