HousingQA (Statute Comprehension)

This task evaluates model ability to read and interpret housing law statutes. Models are given a set of statutes from a state, and prompted with a yes/no question answerable from the statutes. This task thus evaluates how well models can understand statutes. To learn more about HousingQA, see here.

Rank	Model	accuracy	f1_macro	Date	Results
1	claude-sonnet-4-20250514	0.799	0.785	2025-08-04	View
2	claude-opus-4-20250514	0.788	0.780	2025-08-04	View
3	o3-mini	0.775	0.756	2025-08-04	View
4	gpt-5-2025-08-07	0.773	0.766	2025-08-08	View
5	o4-mini	0.762	0.751	2025-08-04	View
6	o3	0.760	0.754	2025-08-05	View
7	claude-3-5-haiku-20241022	0.727	0.715	2025-08-04	View
8	gpt-4o-mini-2024-07-18	0.720	0.714	2025-08-04	View
9	openai/gpt-oss-120b	0.542	0.528	2025-08-07	View

Tasks in This Benchmark

housing_qa_statute_comprehension