arXiv:2605.06317

NavOne

One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li, Jie Tang, Xuemiao Xu

NavOne reformulates Vision-Language Navigation as one-step global path planning over pre-built top-down RGB, occupancy, and semantic maps, directly predicting dense path and goal probabilities in a single forward pass.

Paper PDF BibTeX
NavOne process overview from language instruction and top-down maps to predicted path and goal.
Given a language instruction and multi-modal top-down maps, NavOne predicts complete path and goal probability maps in one forward pass.
47% Val Unseen SR
43% Val Unseen SPL
37 ms Planning time
8x / 80x Speedup over IPPD / ETPNav
6,196 Training episodes

Overview

Global planning instead of step-by-step action loops.

Existing VLN systems commonly act from egocentric observations one step at a time, which can accumulate errors and limits planning efficiency. NavOne introduces Top-Down VLN, where an agent predicts a complete navigation path from a language instruction and a pre-built top-down map.

The model processes RGB, occupancy, and semantic map layers jointly, grounds the instruction with spatial features, and produces interpretable path and goal probability maps that are converted into an executable trajectory.

Method

NavOne predicts the whole route in one pass.

The framework combines map fusion, language-conditioned spatial reasoning, and symbolic path extraction into a single global planning pipeline.

NavOne architecture with map fuser, language encoder, path former, decoder, and path extractor.
NavOne fuses top-down maps, attends to language instructions, predicts dense path and goal distributions, and extracts the final route with A* search.

Top-Down Map Fuser

Concatenates RGB, occupancy, and semantic map representations before patch embedding to form a joint spatial input.

Path Former

Integrates pose, language, and fused map tokens with cross-attention and spatial-aware Attention Residuals.

Path Extractor

Converts predicted path and goal probability maps into an executable navigation trajectory through A* search.

R2R-TopDown

A multi-modal top-down benchmark for language-guided path planning.

R2R-TopDown transfers single-floor R2R-CE trajectories to top-down map representations. Each episode includes a language instruction, start pose, RGB map, occupancy map, semantic map with 41 object categories, and trajectory target.

R2R-TopDown examples showing RGB map, occupancy map, semantic map, and ground-truth trajectory.
Multi-modal map inputs: RGB, occupancy, semantic labels, and reference trajectory.
Scene-level visualization of training episodes with colored trajectories from starts to goals.
Scene-level coverage of training episodes across navigable indoor spaces.
Split Samples Path Length Instruction Length
Train 6,196 9.58 m 26.5
Val Seen 439 9.92 m 27.3
Val Unseen 1,003 9.83 m 26.8

Results

Top success on the R2R-TopDown single-floor evaluation setting.

Primary metrics: Val Unseen SR/SPL. NavOne improves these success metrics on the filtered R2R-TopDown Val Unseen subset while reducing planning time to 37 ms per episode.

Method Seen SR Seen SPL Unseen SR Unseen SPL Unseen TL Unseen NE
WS-MGMap 0.47 0.43 0.39 0.34 10.00 6.28
MapNav - - 0.40 0.37 - 4.93
IPPD 0.57 0.54 0.45 0.42 - -
IPPD* - - 0.37 0.31 - -
NavOne (AR-Full+SQ) 0.57 0.50 0.47 0.43 9.20 5.18

WS-MGMap, MapNav, and IPPD report on full R2R Val Unseen episodes. IPPD* and NavOne are evaluated on the filtered R2R-TopDown Val Unseen subset of 1,003 single-floor episodes, so direct numerical comparisons with NavOne should be interpreted with caution.

NavOne variant summary

The final AR-Full+SQ model is selected for stronger generalization on unseen maps. Standard attention performs best on Val Seen, while spatial-aware depth queries provide the highest Val Unseen SR/SPL.

Variant Seen SR Seen SPL Unseen SR Unseen SPL Unseen TL Unseen NE
NavOne (Std. Attn) 0.63 0.55 0.40 0.36 10.45 6.00
NavOne (AR-Full) 0.59 0.52 0.44 0.40 9.50 5.40
NavOne (AR-Full+SQ) 0.57 0.50 0.47 0.43 9.20 5.18
Qualitative NavOne success case with predicted path, ground truth path, goal probability, and path probability.
Predicted path in red, ground truth in green, with goal and path probability maps.
ETPNav 2970 ms
IPPD 300 ms
NavOne 37 ms

Planning times are measured end-to-end from instruction input to executable path output on the same NVIDIA 4090D GPU. The ETPNav comparison reflects planning speed under the pre-built-map assumption.

Citation

Cite NavOne

The paper is available as arXiv:2605.06317.

@article{zhan2026navone,
  title = {NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps},
  author = {Zhan, Dijia and Li, Jinyi and Zheng, Chenxi and Huang, Shaoyu and Li, Yong and Tang, Jie and Xu, Xuemiao},
  journal = {arXiv preprint arXiv:2605.06317},
  year = {2026}
}