Legal AI Document Summarisation and clause classification
This project focuses on building a Legal AI system to support automated contract analysis. The work explores clause classification and document summarisation using real-world legal contracts from the CUAD dataset.
The system identifies and classifies key contractual clauses and generates concise summaries to help reduce the time and effort required for manual contract review. The project combines modern NLP techniques with transformer-based models to handle complex legal language while maintaining accuracy and consistency.
Problem Statement
Legal contracts are often lengthy, complex, and time-consuming to review due to dense legal language and varied clause structures. Manual document analysis requires significant domain expertise and is prone to inconsistency, especially when identifying critical clauses across large volumes of documents.
There is a need for automated methods that can accurately identify key contractual clauses and produce concise summaries while preserving legal meaning. Existing approaches often struggle with domain specificity and long document contexts, motivating the exploration of transformer-based models for legal document understanding.
Approach
The project was implemented as a multi-stage NLP pipeline tailored for legal contracts. After preprocessing and analysing the CUAD dataset, clause classification was formulated as a multi-label task using transformer-based models. A fine-tuned LegalBERT model was employed to capture domain-specific legal context.
For document summarisation, a hybrid extractive–abstractive approach was used to ensure important legal information was retained while producing concise summaries. The system was evaluated using standard metrics, prioritising semantic accuracy and reliability in legal text processing.
Results
The clause classification model achieved consistent performance across common contractual clauses, with improved recall observed after domain-specific fine-tuning. Evaluation using precision, recall, and F1-score showed that the model was able to capture key legal clauses with reasonable accuracy, particularly in high-frequency categories.
For summarisation, the hybrid extractive–abstractive approach generated summaries that were both concise and semantically faithful to the original contracts. Automatic evaluation metrics and qualitative inspection confirmed that important legal information was retained, although performance was influenced by document length and clause complexity.
Tech Stack
Python • PyTorch • Hugging Face Transformers • LegalBERT • Extractive & Abstractive Summarisation • CUAD Dataset • NLP