HousingQA (Statute Comprehension)
This task evaluates model ability to read and interpret housing law statutes. Models are given a set of statutes from a state, and prompted with a yes/no question answerable from the statutes. This task thus evaluates how well models can understand statutes. To learn more about HousingQA, see here.Rank | Model | accuracy | f1_macro | Date | Results |
---|---|---|---|---|---|
1 | claude-sonnet-4-20250514 | 0.799 | 0.785 | 2025-08-04 | View |
2 | claude-opus-4-20250514 | 0.788 | 0.780 | 2025-08-04 | View |
3 | o3-mini | 0.775 | 0.756 | 2025-08-04 | View |
4 | gpt-5-2025-08-07 | 0.773 | 0.766 | 2025-08-08 | View |
5 | o4-mini | 0.762 | 0.751 | 2025-08-04 | View |
6 | o3 | 0.760 | 0.754 | 2025-08-05 | View |
7 | claude-3-5-haiku-20241022 | 0.727 | 0.715 | 2025-08-04 | View |
8 | gpt-4o-mini-2024-07-18 | 0.720 | 0.714 | 2025-08-04 | View |
9 | openai/gpt-oss-120b | 0.542 | 0.528 | 2025-08-07 | View |