← Back to Projects

Indian Legal Corpus (ILC)

A Dataset for Summarizing Indian Legal Proceedings

NLP Legal AI Dataset Research

Overview

Indian Legal Corpus (ILC) is a curated dataset of 3,000+ Indian legal judgments and their parallel summaries. This dataset addresses the gap in resources for legal NLP research in the Indian context.

Published Research: "Indian Legal Corpus (ILC): A Dataset for Summarizing Indian Legal Proceedings Using Natural Language"

Dataset

We have scraped and compiled a corpus of 3k+ Indian legal judgments and their parallel summaries.

Loading the dataset:

from datasets import load_dataset
import pandas as pd
dataset = load_dataset("d0r1h/ILC")
train_set = pd.DataFrame(dataset['train'])
test_set = pd.DataFrame(dataset['test'])

Installation & Usage

Clone and setup:

git clone https://github.com/d0r1h/ILC.git
cd ILC
pip install -r requirement.txt

Extractive Summarization:

python Code/Models/extractive.py \
--output_dir dir_name \
--text_column text \
--summary_column summary \
--data_file data.csv \
--sentence_count 3

Training LED (Abstractive):

python Code/Models/led_summarization.py \
--model_name allenai/led-base-16384 \
--text_column Case \
--summary_column Summary \
--max_input_length 8192 \
--max_output_length 600 \
--batch_size 2 \
--num_beams 2 \
--output_dir output_dir_name

Pre-trained Model

We provide a fine-tuned LED model for Indian legal document summarization:

  • Model: led-base-ilc on Hugging Face
  • Base: allenai/led-base-16384
  • Training: Fine-tuned on ILC dataset

Results

Comprehensive evaluation on test-set with transformer-based models and extractive methods. The dataset enables research on legal document summarization specific to Indian legal proceedings.