Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Xuan Yu1, Yuxuan Xie1, Yili Liu1, Sitong Mao2, Shunbo Zhou2, Haojian Lu1, Rong Xiong1, Yiyi Liao1, Yue Wang1*
1Zhejiang University, 2Huawei,

Abstract

Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries ) and the scene’s 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator.


Motivation

End-to-end open-vocabulary panoptic reconstruction by 2D foundation model faces three challenges:

1) Misalignment: 2D instance IDs across frames are not align.
2) Ambiguity: Due to the limited FoV, two objects that never co-occur in an image can be the same or different instances.
3) Inconsistency: The semantic and instance segmentations obtained from two independent networks are inconsistent.

We align 2D instance IDs by instance tokens linear assignment, eliminate the ambiguity of 3D instances by incorporating spatial prior, and output consistent semantic and instance masks by a parameter-free panoptic head, generating the geometric mesh with panoptic masking that allows for multi-branch novel-view synthesis.

Method

Comparison

Panoptic / Semantic Mesh

PVLFF has over segmentation and weak semantic segmentation accuracy, while our method can segment object-level instance and demonstrate high accuracy of panoptic/semantic segmentation and reconstruction quality.

Panoptic Lifting relies only on 2D VLM mask observations without any 3D spatial prior, suffering from FoV limitation, resulting in some tables and chairs in different rooms have the same ID, while our method solves this problem by introducing 3D instance spatial prior.

Panoptic / Semantic Rendering

---

Robotics Simulator

Jackal UGV starts navigation in Gazebo with the meshes generated from PanopticRecon++ trained on ScanNet++.

More Results

BibTeX

@misc{yu2025leveragecrossattentionendtoendopenvocabulary,
      title={Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction}, 
      author={Xuan Yu and Yuxuan Xie and Yili Liu and Haojian Lu and Rong Xiong and Yiyi Liao and Yue Wang},
      year={2025},
      eprint={2501.01119},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.01119}, 
}