Data Sets
SigmaLaw Data Sets
Large Legal Text Corpus and Word Embeddings Data Set
This data set is comprised of data gathered for and created in the process of the paper Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity. It contains a large legal data text corpus, several word2vec embedding models of the words in the said corpus, and a set of legal domain gazetteer lists.
The entire data set is hosted at OSF. Direct links to the files are as follows:
- Legal Case Corpus: This corpus contains 39,155 legal cases including 22,776 taken from the United States supreme court. For the convenience of the future researchers, we have also included 29,404 cases after some preprocessing. A map (key) is included for the folder numbering in the provided zip file.
- Legal Domain Word2Vec models: Two word2vec models trained on the above corpus are included. One trained on raw legal text and one trained on the same text after lemmatization.
- Legal Domain gazetteer lists: A number of gazetteer lists built by a legal professional to indicate domain specific semantic grouping is included.
- Word2Vec results: Finally the results obtained by this paper using the trained word2vec models are included. [100x100] [100x200] [100x210] [100x500]
Citing this DataSet:
If you are using our Large Legal Text Corpus and Word Embeddings data set, please cite this paper:
K. Sugathadasa, B. Ayesha, N. de Silva, A. Perera, V. Jayawardana, D. Lakmal, and M. Perera, "Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity," in 2017 IEEE International Conference on Industrial and Information Systems (ICIIS), IEEE, 2017, pp. 1--6,
Legal Information Retrieval Data Set
This data set is comprised of data gathered for and created in the process of the paper Legal Document Retrieval using Document Vector Embeddings and Deep Learning. Other than the files provided here, it uses the large legal text corpus data set mentioned above, out of which it takes a set of raw cases. In addition, this data set contains a mention map, edge list, and the output of legal text ranking.
The entire data set is hosted at OSF. Direct links to the files are as follows:
- Text Corpus and Map: This corpus contains 2,500 cases extracted from the large legal text corpus data set given above. With the set of these case files is provided a mention map which indicates which case have cited which other case within the corpus.
- Edge List: This is the edge list of the citation graph generated by the above mention map.
- Outputs: Finally the results obtained by this paper are included in the text rank form and as a serialized file.
Citing this DataSet:
If you are using our Legal Information Retrieval data set, please cite this paper:
K. Sugathadasa, B. Ayesha, N. de Silva, A. Perera, V. Jayawardana, D. Lakmal, and M. Perera, "Legal Document Retrieval using Document Vector Embeddings and Deep Learning," in Science and information conference, Springer, 2018, pp. 160--175,
Legal Ontology Building Data Set
This data set is comprised of data gathered for and created in the process of the paper Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings. Other than the files provided here, it uses the large legal text corpus data set mentioned above, out of which it takes a set of raw cases. In addition, this data set contains the created ontology, a gazetteer list, and the result vectors.
The entire data set is hosted at OSF. Direct links to the files are as follows:
- Legal Ontology: This is the limited legal ontology built for the purpose of this study.
- Case Files: This corpus contains X cases extracted from the large legal text corpus data set given above.
- Legal Domain gazetteer lists: A set of gazetteer lists built by a legal professional and by data collection are included.
- Results: Finally the result vectors obtained by this paper are included.
Citing this DataSet:
If you are using our Legal Ontology Building data set, please cite this paper:
V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, and B. Ayesha, "Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings," in 2017 Seventh International Conference on Innovative Computing Technology (INTECH), IEEE, 2017, pp. 79--84,
Legal Ontology Population Data Set
This data set is comprised of data gathered for and created in the process of the paper Word Vector Embeddings and Domain Specific Semantic based Semi-Supervised Ontology Instance Population. Other than the files provided here, it uses the large legal text corpus data set out of which it takes a set of raw cases and the small legal ontology from the Legal Ontology Building data set. However, we do not include that ontology in this data set. Please download it from above. The domain specific semantic is based on the result models built by the large legal text corpus study. This data set contains the class instances by the proposed models, a gazetteer list of legal words, and the result vectors.
The entire data set is hosted at OSF. Direct links to the files are as follows:
- Class instances by 5 models: These are the instances to be used to populate the classes in the ontology according to the 5 proposed models.
- Legal words: This is a set of gazetteer lists of words in the legal domain prepared with the help of a legal professional.
- Results: Finally the result vectors obtained by this paper are included.
Citing this DataSet:
If you are using our Legal Ontology Population data set, please cite one or both of these papers:
V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, B. Ayesha, and M. Perera, "Word Vector Embeddings and Domain Specific Semantic based Semi-Supervised Ontology Instance Population," ICTer, vol. 11, no. 1, 2018,
V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, B. Ayesha, and M. Perera, "Semi-Supervised Instance Population of an Ontology using Word Vector Embeddings," in Advances in ICT for Emerging Regions (ICTer), 2017 Seventeenth International Conference on, IEEE, Sep. 2017,
Automatic Generation of Legal Arguments Data Set
This data set is comprised of data gathered for and created in the process of the student thesis Ontology-Based Information Extraction for Automatic Generation of Legal Arguments. It contains an annotated list of arguments, sentiment tagged phrases, discourse annotated sentence pairs, and domain similarity annotated verb pairs. All the raw data was taken from the Large Legal Text Corpus and Word Embeddings Data Set above.
The entire data set is hosted at OSF. Direct links to the files are as follows:
- Annotated list of arguments: These are the 77 sentences annotated by a domain expert denoting whether each sentence constitutes a legal argument or not.
- Sentiment tagged phrases: This is a set of 394 sentences annotated by a domain expert with the tags: positive, neutral, and negative.
- Discourse annotated sentence pairs: This is a set of 87 sentence pairs annotated by a domain expert with the tags: Elaboration, No Relation, and Shift-in-View.
- Domain similarity annotated verb pairs: Finally, this is a set of 1000 verb pairs annotated by a domain expert to indicate whether or not the given verb pair is similar in the legal domain.
Citing this DataSet:
If you are using our annotated list of arguments, please cite this paper:
G. Ratnayaka, T. Rupasinghe, N. de Silva, M. Warushavithana, V. Gamage, and A. Perera, "Identifying Relationships Among Sentences in Court Case Transcripts Using Discourse Relations," in 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, 2018, pp. 13--20,If you are using our sentiment tagged phrases or domain similarity annotated verb pairs, please cite this paper:
V. Gamage, M. Warushavithana, N. de Silva, A. Perera, G. Ratnayaka, and T. Rupasinghe, "Fast Approach to Build an Automatic Sentiment Annotator for Legal Domain using Transfer Learning," in Proceedings of the 9th workshop on computational approaches to subjectivity, sentiment and social media analysis, 2018, pp. 260-265,If you are using our discourse annotated sentence pairs, please cite this paper:
G. Ratnayaka, T. Rupasinghe, N. de Silva, V. Gamage, M. Warushavithana, and A. Perera, "Shift-of-Perspective Identification Within Legal Cases," in Proceedings of the 3rd Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts, 2019, pp. to appear,
Legal Party Extraction Data Set
This data set is comprised of data and models gathered for and created in the process of the student thesis Party Identification of Legal Documents. It contains an annotated list of arguments, sentiment tagged phrases, discourse annotated sentence pairs, and domain similarity annotated verb pairs. All the raw data was taken from the Large Legal Text Corpus and Word Embeddings Data Set above.
The entire data set is hosted at OSF. Direct links to the files are as follows:
- Party annotated sentences: These are 1000 sentences annotated by a domain expert at the word level denoting whether the relevant word is part of a legal party or not.
- Data 750: This is a numpy file of 750 sentences annotated by a domain expert and processed to be loaded to our model.
- Labeled 1000: This is a numpy file of 1000 sentences from the party annotated sentences above annotated by a domain expert as petitioner and defendant and processed to be loaded to our model.
- GRU 512: This is our GRU (512) model trained using the above party annotated sentence data.
- GRU 512 (P|D): This is our GRU (512) model trained using the above Labeled 1000 data.
Citing this DataSet:
If you are using our party annotated sentences or Data 750, please cite this paper:
M. de Almeida, C. Samarawickrama, N. de Silva, G. Ratnayaka, and A. Perera, "Legal Party Extraction from Legal Opinion Text with Sequence to Sequence Learning," in 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, 2020, pp. 143--148,If you are using our Labeled 1000 data set or using our GRU 512 model or GRU 512 (P|D) model, please cite this paper:
C. Samarawickrama, M. de Almeida, N. de Silva, G. Ratnayaka, and A. Perera, "Party Identification of Legal Documents using Co-reference Resolution and Named Entity Recognition," in 2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS), IEEE, 2020, pp. 494--499,
Winning Party Prediction
This data set is comprised of data and models gathered for and created in the process of the student thesis SigmaLaw - Predicting Winning Party of a Legal Case Using Legal Opinion Texts. It contains a data set of extracted sentences from criminal cases, labled critical sentences, embedded legal vocabulary, and expert tagged legal causality senetence pairs. All the raw data was taken from the Large Legal Text Corpus and Word Embeddings Data Set above. The code for this proect is available on github.
The entire data set is hosted at OSF. Direct links to the files are as follows:
- Criminal Case Sentence Dataset: These are CSV files with sentences extracted from criminal cases and reported with token count: 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, and 10,000.
- Critical Sentence Identification: These are 1844 sentences annotated with informatin of the ultimate victory (Petitioner perspective), overall sentiment, and relative sentiment (Petitioner perspective)
- Embedded Legal Vocabulary: min frequency 3, BERT (10000), BERT (30000), and Tensorflow (30000).
- Legal Sentence Dataset: these are raw sentence pairs tagged by a legal expert on legal causality and a merged set.
Citing this DataSet:
If you are using our Criminal Case Sentence Dataset, please cite this paper:
S. Jayasinghe, L. Rambukkanage, A. Silva, N. de Silva, and A. Perera, "Party-based Sentiment Analysis Pipeline for the Legal Domain," in 2021 21st International Conference on Advances in ICT for Emerging Regions (ICter), 2021, pp. 171-176,If you are using our Critical Sentence Identification or Embedded Legal Vocabulary, please cite this paper:
S. Jayasinghe, L. Rambukkanage, A. Silva, N. de Silva, and A. Perera, "Critical Sentence Identification in Legal Cases Using Multi-Class Classification," arXiv preprint arXiv:2111.05721, 2021,
Other Data Sets
PubMed DataSet
PubMed data set contains the data collected for and produced by the project of 'Discovering Inconsistencies in PubMed Abstracts Through Ontology-Based Information Extraction'. This includes three levels of data: source files, Intermediate Outputs, and Output Files. The source files contains a list of PubMed ids and 36,877 PubMed abstracts. The intermediate outputs contains: full Stanford core NLP results and OLLIE triple extractor results for all the above abstracts and finalized set of triples compatible with the OMIT Ontology. Finally, the output files are contained of: a medical dictionary built out of the above PubMed abstract corpus, discovered raw inconsistencies, and the Discovered Final inconsistencies in the expanded form.
The entire data set is hosted at OSF. Direct links to the files are as follows:
Source Files (collected data):
- PubMed IDs: tar.gz file (111kB)
- Collected PubMed abstracts: tar.gz file (18.5MB)
Intermediate Outputs:
- PubMed abstracts parsed with Stanford Parser: tar.gz files
- OLLIE triples created from PubMed abstracts: tar.gz file (27.3MB)
- Created final triples: tar.gz file (2.9MB)
Output Files:
- Created final triples: tar.gz file (2.9MB)
- Created medical dictionary: text file (4.8MB)
- Discovered raw inconsistencies: text file (65kB)
- Discovered Final inconsistencies (expanded): text file (63kB)
Source Code:
Java has been used as the programming language in implimentations for the entire project. The Java source files are available from the GitHub organization OMIT-PubMed-Inconsistencies. It contains source code for the following projects:
- Abstract Extractor
- Extra PubMed Id Finder
- Dictionary Creator
- PubMed Splitter
- Sentence Breaker
- OMIT connector
- MeSH Term Extractor
- Triplet Creator
- Consistency Checker
- Result Simplifier
Citing PubMed DataSet paper:
If you are using our PubMed DataSet, please cite this paper:
N. de Silva, D. Dou, and J. Huang, "Discovering Inconsistencies in PubMed Abstracts Through Ontology-Based Information Extraction," in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2017, pp. 362--371,
SinMin DataSet
Sinmin contains texts of different genres and styles of the modern and old Sinhala language. The main sources of electronic copies of texts for the corpus are online Sinhala newspapers, online Sinhala news sites, Sinhala school textbooks available in online, online Sinhala magazines, Sinhala Wikipedia, Sinhala fictions available in online, Mahawansa, Sinhala Blogs, Sinhala subtitles and Sri lankan gazette.
The entire Sinhala text corpus is hosted at OSF in compressed and uncompressed versions. Direct links to the compressed files are as follows:
Citing SinMin DataSet paper:
If you are using our SinMin DataSet, please cite this paper:
D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. De Silva, and G. Dias, "Implementing a Corpus for Sinhala Language," in Symposium on Language Technology for South Asia 2015, 2015,
SiClaEn DataSet
SiClaEn data set contains a Reuters English News DataSet and a Sinhala News DataSet. The Sinhala News DataSet was collected from bi-lingual Sinhala and English news sources such as AdaDerana and NewsFirst. The Reuters English News DataSet has 7103 sentences in 383 posts and the Sinhala News DataSet has 5221 sentences in 471 posts. All data sets are categorized pertaining to the following topics; business, entertainment, politics, Science& technology, and sports.
The entire Sinhala and English text corpus is hosted at OSF. Direct links to the files are as follows:
English Data (Reuters):
Sinhla Data:
Citing SiClaEn DataSet paper:
If you are using our SiClaEn DataSet, please cite this paper:
N. de Silva, "Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language," , 2015,