PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

Abstract

We introduce PlanQA, a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models (LLMs). PlanQA is grounded in structured representations of indoor scenes, such as kitchens, living rooms, and bedrooms, encoded in a symbolic format (e.g., JSON, XML layouts). The benchmark includes diverse question types that test not only metric and topological reasoning (e.g., distance, visibility, shortest paths) but also interior design constraints such as affordance, clearance, balance, and usability. Our results across a variety of frontier open-source and commercial LLMs show that while models may succeed in shallow queries, they often fail to simulate physical constraints, preserve spatial coherence, or generalize under layout perturbation. PlanQA uncovers a clear blind spot in today’s LLMs: they don’t consistently reason about real-world layouts. We hope that this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

Example Layouts

Representative layouts from the PlanQA dataset, spanning three room types (kitchen, living room, bedroom) and three geometric configurations (rectangular, L-shaped, open). Layouts are procedurally generated using explicit spatial constraints and validated for functional feasibility.

Model Performance Analysis

**Question-level accuracy by category for each model across room types.**
**Models (left to right):** Qwen3-32B, DeepSeek-R1, DeepSeek-V3, LLaMA 3.3-70B, Gemma 2-27B, Phi-4, GPT-4.1, LLaMA 3.1-8B, Gemma 2-9B, Phi 3.5-mini, GPT-4o-mini.
^†*Note: High truncation rate due to token limits.*
		Reasoning		Big Models					Small Models
	N	32.8B	671B	671B	70B	27.2B	14B	N/A	8B	9.2B	3.8B	N/A
	Temp.	0.6	0.6	0.3	0.0	0.5	0.0	1.0	0.6	0.5	0.7	1.0
K	Distance	97.2	99.5	98.5	97.2	94.7	90.7	100	82.3	74.2	35.3	97.3
	Area (counters)	85.3	99.5	95.7	81.7	46.5	79.2	98.5	30.8	37.5	9.8	80.3
	Free Space	83.7	88.0	52.5	37.0	23.5	42.0	83.5	9.5	13.5	6.3	20.3
	View Angle	74.0	58.2	69.2	75.2	12.7	52.8	86.3	11.5	9.2	8.2	42.2
	Repositioning	91.5	96.3	69.8	48.7	14.8	22.7	85.8	3.7	6.2	11.5	40.7
	Max Box	33.3	30.7	9.2	8.8	3.0	3.0	57.2	1.7	1.2	1.7	3.7
	Fit/Placement	92.8	92.5	71.0	66.2	72.0	82.0	89.8	68.7	71.8	68.8	70.7
	Path (Valid)	13.2^†	6.3^†	39.3	30.3	23.2	26.8	73.0	10.5	17.5	14.8	35.7
	Path (Fréchet)	15.0	6.3	32.8	30.2	26.0	26.0	55.8	10.2	19.0	9.3	36.7
	Missing Object	87.3	88.7	44.3	56.2	52.3	52.2	79.8	27.5	39.7	14.3	58.0
	Obstruction	84.0	95.2	32.7	6.0	3.3	9.7	93.5	1.2	2.7	11.3	14.2
L	Distance	98.7	99.8	99.5	98.2	96.3	98.8	99.8	87.0	81.8	58.8	98.5
	Area (sitting)	96.7	99.5	84.5	98.0	83.8	88.7	99.5	32.5	41.3	12.8	85.7
	Free Space	0.3	4.3	1.0	0.3	3.7	0.3	5.0	0.7	3.2	1.5	1.7
	View Angle	81.5	86.0	70.0	76.8	14.3	50.3	96.0	14.8	8.3	10.5	42.2
	Repositioning	80.5	93.0	45.0	33.5	12.8	19.2	71.5	6.5	5.7	6.3	29.3
	Max Box	1.5^†	0.8^†	1.0	3.0	2.0	2.8	6.5	0.8	1.5	1.0	2.3
	Fit/Placement	90.7	91.2	71.5	80.0	83.7	87.7	91.8	75.0	75.5	72.8	72.7
	Path (Valid)	10.0^†	14.7^†	33.0	26.7	30.5	26.7	53.2	7.3	21.7	16.7	33.5
	Path (Fréchet)	13.2	14.2	23.7	17.8	19.2	14.7	48.0	2.5	14.2	6.8	25.2
	Missing Object	73.0	76.2	51.0	49.3	29.5	36.0	65.5	9.7	28.3	11.7	32.3
	Obstruction	80.7	96.5	24.3	7.3	3.8	9.3	84.7	2.5	5.2	4.7	11.7
B	Distance	98.7	99.8	99.5	98.2	96.3	98.8	99.8	87.0	81.8	58.8	98.5
	Area (storage)	98.7	99.8	94.0	97.0	86.3	88.3	99.3	30.3	66.3	47.7	88.2
	Free Space	1.7	5.8	1.2	0.3	2.5	1.2	2.8	1.8	1.0	1.0	1.2
	View Angle	76.0	79.8	70.0	78.3	10.8	57.0	94.2	15.3	10.2	7.7	43.0
	Repositioning	78.7	94.3	53.8	36.0	11.2	15.7	73.7	4.3	9.0	6.5	31.5
	Max Box	0.7^†	1.0^†	2.0	1.8	2.0	2.5	7.2	1.0	0.7	1.0	2.0
	Fit/Placement	86.3	86.3	65.8	66.7	66.5	73.0	82.0	66.0	63.8	64.7	66.2
	Path (Valid)	15.5^†	20.8^†	49.7	40.7	38.5	35.8	67.8	11.3	34.5	20.5	48.8
	Path (Fréchet)	15.3	21.5	30.3	30.3	27.0	19.0	49.5	3.7	19.7	8.5	36.3
	Missing Object	64.3	65.2	39.8	40.3	33.0	25.2	61.0	14.8	19.7	8.3	34.5
	Obstruction	87.2	95.7	32.0	4.3	1.3	5.8	89.5	2.2	5.3	9.3	12.2

BibTeX

@article{rodionov2025planqa,
  author    = {Rodionov, Fedor and Eldesokey, Abdelrahman and Birsak, Michael and Femiani, John and Ghanem, Bernard and Wonka, Peter},
  title     = {PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations},
  journal   = {arXiv preprint},
  year      = {2025},
}