LegalBench (Reasoning)

This leaderboard focuses on a subset of LegalBench's more complex and reasoning-heavy tasks. Within LegalBench, these are referred to as 'rule-application' tasks, because they require models to apply some multi-step rule to a fact pattern. You can read more about LegalBench here. We have found that performance on these tasks generally correlates with underlying model reasoning capabilities.

Note: Only models that have been evaluated on all 11 tasks in this preset are included in the leaderboard.


Rank Model Wins Average Rank Raw Metric Avg Details
1 grok-4-0709 9 1.64 0.9319
2 claude-opus-4-1-20250805 8 1.91 0.9228
3 gpt-5-2025-08-07 8 2.00 0.9239
4 claude-opus-4-20250514 7 2.27 0.9249
5 o3 7 2.82 0.9192
6 grok-3-mini 7 3.36 0.8901
7 o4-mini 4 5.18 0.9059
8 claude-sonnet-4-20250514 4 6.36 0.8952
9 o3-mini 6 7.00 0.8566
10 openai/gpt-oss-120b 3 8.45 0.8884
11 claude-3-5-haiku-20241022 2 9.00 0.8803
12 deepseek-ai/DeepSeek-V3 2 10.55 0.8384
13 meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo 1 10.64 0.8457
14 deepseek-ai/DeepSeek-R1 2 13.27 0.6825
15 gpt-4o-mini 0 14.09 0.7972
16 meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo 0 14.18 0.7707
17 google/gemma-2-27b-it 0 14.64 0.7673
18 claude-3-haiku-20240307 0 16.55 0.6789
19 gpt-4.1-nano 0 17.82 0.5911

Tasks in This Benchmark