Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries ) and the scene’s 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator.
End-to-end open-vocabulary panoptic reconstruction by 2D foundation model faces three challenges:
1) Misalignment: 2D instance IDs across frames are not align.
2) Ambiguity: Due to the limited FoV, two objects that never co-occur in an image can be the same or different instances.
3) Inconsistency: The semantic and instance segmentations obtained from two independent networks are inconsistent.
PVLFF has over segmentation and weak semantic segmentation accuracy, while our method can segment object-level instance and demonstrate high accuracy of panoptic/semantic segmentation and reconstruction quality.
Panoptic Lifting relies only on 2D VLM mask observations without any 3D spatial prior, suffering from FoV limitation, resulting in some tables and chairs in different rooms have the same ID, while our method solves this problem by introducing 3D instance spatial prior.
---
Jackal UGV starts navigation in Gazebo with the meshes generated from PanopticRecon++ trained on ScanNet++.
@misc{yu2025leveragecrossattentionendtoendopenvocabulary,
title={Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction},
author={Xuan Yu and Yuxuan Xie and Yili Liu and Haojian Lu and Rong Xiong and Yiyi Liao and Yue Wang},
year={2025},
eprint={2501.01119},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.01119},
}