HousingQA (Statute Comprehension)

This task evaluates model ability to read and interpret housing law statutes. Models are given a set of statutes from a state, and prompted with a yes/no question answerable from the statutes. This task thus evaluates how well models can understand statutes. To learn more about HousingQA, see here.

Rank Model accuracy f1_macro Date Results
1 claude-sonnet-4-20250514 0.799 0.785 2025-08-04 View
2 claude-opus-4-20250514 0.788 0.780 2025-08-04 View
3 o3-mini 0.775 0.756 2025-08-04 View
4 gpt-5-2025-08-07 0.773 0.766 2025-08-08 View
5 o4-mini 0.762 0.751 2025-08-04 View
6 o3 0.760 0.754 2025-08-05 View
7 claude-3-5-haiku-20241022 0.727 0.715 2025-08-04 View
8 gpt-4o-mini-2024-07-18 0.720 0.714 2025-08-04 View
9 openai/gpt-oss-120b 0.542 0.528 2025-08-07 View

Tasks in This Benchmark