Aggregate Leaderboards - LegalEvalHub

LegalEvalHub offers curated leaderboards that aggregate performance across multiple tasks. Aggregated leaderboards may encompass tasks from the same benchmark, on the same type of document, or on the same type of legal reasoning.

Models are evaluated on each individual task, then scores are aggregated to produce overall leaderboard performance. We calculate three metrics:

Wins: Number of tasks where the model ranked first, showing dominance across individual tasks. Models with identical scores share the same rank and both receive wins.
Average Rank: Mean ranking position across tasks, showing consistency of performance.
Raw Metric Average: Average of actual metric values (e.g., accuracy, F1), preserving the original scale. The metric we use is the first one listed in the task's metrics field. You can see the task configs here.

To appear on an aggregate leaderboard, models must be evaluated on all tasks within that leaderboard. This ensures fair comparison and prevents selective reporting.

Available Aggregate Leaderboards

LegalBench (Full)

LegalBench (Reasoning)

HousingQA (Knowledge)

HousingQA (Statute Comprehension)