NavOne

One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

Dijia Zhan^*, Jinyi Li^*, Chenxi Zheng, Shaoyu Huang, Yong Li^†, Jie Tang^†, Xuemiao Xu^†

South China University of Technology

^* Equal contribution ^† Corresponding authors

NavOne reformulates Vision-Language Navigation as one-step global path planning over pre-built top-down RGB, occupancy, and semantic maps, directly predicting dense path and goal probabilities in a single forward pass.

Paper PDF BibTeX

NavOne process overview from language instruction and top-down maps to predicted path and goal. — Given a language instruction and multi-modal top-down maps, NavOne predicts complete path and goal probability maps in one forward pass.

47% Val Unseen SR

43% Val Unseen SPL

37 ms Planning time

8x / 80x Speedup over IPPD / ETPNav

6,196 Training episodes

Overview

Global planning instead of step-by-step action loops.

Existing VLN systems commonly act from egocentric observations one step at a time, which can accumulate errors and limits planning efficiency. NavOne introduces Top-Down VLN, where an agent predicts a complete navigation path from a language instruction and a pre-built top-down map.

The model processes RGB, occupancy, and semantic map layers jointly, grounds the instruction with spatial features, and produces interpretable path and goal probability maps that are converted into an executable trajectory.

Method

NavOne predicts the whole route in one pass.

The framework combines map fusion, language-conditioned spatial reasoning, and symbolic path extraction into a single global planning pipeline.

NavOne architecture with map fuser, language encoder, path former, decoder, and path extractor. — NavOne fuses top-down maps, attends to language instructions, predicts dense path and goal distributions, and extracts the final route with A* search.

Top-Down Map Fuser

Concatenates RGB, occupancy, and semantic map representations before patch embedding to form a joint spatial input.

Path Former

Integrates pose, language, and fused map tokens with cross-attention and spatial-aware Attention Residuals.

Path Extractor

Converts predicted path and goal probability maps into an executable navigation trajectory through A* search.

R2R-TopDown

A multi-modal top-down benchmark for language-guided path planning.

R2R-TopDown transfers single-floor R2R-CE trajectories to top-down map representations. Each episode includes a language instruction, start pose, RGB map, occupancy map, semantic map with 41 object categories, and trajectory target.

R2R-TopDown examples showing RGB map, occupancy map, semantic map, and ground-truth trajectory. — Multi-modal map inputs: RGB, occupancy, semantic labels, and reference trajectory.

Scene-level visualization of training episodes with colored trajectories from starts to goals. — Scene-level coverage of training episodes across navigable indoor spaces.

Split	Samples	Path Length	Instruction Length
Train	6,196	9.58 m	26.5
Val Seen	439	9.92 m	27.3
Val Unseen	1,003	9.83 m	26.8

Results

Top success on the R2R-TopDown single-floor evaluation setting.

Primary metrics: Val Unseen SR/SPL. NavOne improves these success metrics on the filtered R2R-TopDown Val Unseen subset while reducing planning time to 37 ms per episode.

Method	Seen SR	Seen SPL	Unseen SR	Unseen SPL	Unseen TL	Unseen NE
WS-MGMap	0.47	0.43	0.39	0.34	10.00	6.28
MapNav	-	-	0.40	0.37	-	4.93
IPPD	0.57	0.54	0.45	0.42	-	-
IPPD^*	-	-	0.37	0.31	-	-
NavOne (AR-Full+SQ)	0.57	0.50	0.47	0.43	9.20	5.18

WS-MGMap, MapNav, and IPPD report on full R2R Val Unseen episodes. IPPD^* and NavOne are evaluated on the filtered R2R-TopDown Val Unseen subset of 1,003 single-floor episodes, so direct numerical comparisons with NavOne should be interpreted with caution.

NavOne variant summary

The final AR-Full+SQ model is selected for stronger generalization on unseen maps. Standard attention performs best on Val Seen, while spatial-aware depth queries provide the highest Val Unseen SR/SPL.

Variant	Seen SR	Seen SPL	Unseen SR	Unseen SPL	Unseen TL	Unseen NE
NavOne (Std. Attn)	0.63	0.55	0.40	0.36	10.45	6.00
NavOne (AR-Full)	0.59	0.52	0.44	0.40	9.50	5.40
NavOne (AR-Full+SQ)	0.57	0.50	0.47	0.43	9.20	5.18

Qualitative NavOne success case with predicted path, ground truth path, goal probability, and path probability. — Predicted path in red, ground truth in green, with goal and path probability maps.

ETPNav 2970 ms

IPPD 300 ms

NavOne 37 ms

Planning times are measured end-to-end from instruction input to executable path output on the same NVIDIA 4090D GPU. The ETPNav comparison reflects planning speed under the pre-built-map assumption.

Qualitative Gallery

Diverse route predictions across indoor and real-robot settings.

Multi-room NavOne navigation example. — Multi-room navigation

Long-corridor NavOne navigation example. — Long-corridor navigation

Kitchen NavOne navigation example. — Kitchen navigation

Outdoor pool area NavOne navigation example. — Outdoor pool area

Real-robot corridor navigation example one. — Real-robot corridor 1

Real-robot corridor navigation example two. — Real-robot corridor 2

Citation

Cite NavOne

The paper is available as arXiv:2605.06317.

@article{zhan2026navone,
  title = {NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps},
  author = {Zhan, Dijia and Li, Jinyi and Zheng, Chenxi and Huang, Shaoyu and Li, Yong and Tang, Jie and Xu, Xuemiao},
  journal = {arXiv preprint arXiv:2605.06317},
  year = {2026}
}