LegalEvalHub offers curated leaderboards that aggregate performance across multiple tasks. Aggregated leaderboards may encompass tasks from the same benchmark, on the same type of document, or on the same type of legal reasoning.
Models are evaluated on each individual task, then scores are aggregated to produce overall leaderboard performance. We calculate three metrics:
- Wins: Number of tasks where the model ranked first, showing dominance across individual tasks. Models with identical scores share the same rank and both receive wins.
- Average Rank: Mean ranking position across tasks, showing consistency of performance.
- Raw Metric Average: Average of actual metric values (e.g., accuracy, F1), preserving the original scale. The metric we use is the first one listed in the task's metrics field. You can see the task configs here.
Available Aggregate Leaderboards
LegalBench (Full)
161 LegalBench tasks.
LegalBench (Reasoning)
Subset of LegalBench focused on more complex reasoning.
HousingQA (Knowledge)
Tests model knowledge of housing law.
HousingQA (Statute Comprehension)
Tests model ability to read and interpret housing law statutes.