REALM: VLM-Powered Open-World 3D Reasoning Segmentation and Editing via Gaussian Splatting

Submit to AAAI 2026
Anonymous Authors

👑 Abstract

3D open-vocabulary understanding is a fundamental task in computer vision, as humans naturally describe the 3D world using language. Although prior works have attempted to bridge the gap between natural language and 3D representations, this task remains highly challenging due to the implicit and ambiguous nature of human instructions. Recent advances in large vision-language models (LVLMs) have demonstrated impressive capabilities in performing semantic reasoning over 2D images. In this paper, we introduce REALM, an LVLM-powered framework for 3D reasoning. We focus on addressing one of the most critical challenges in 3D understanding—3D reasoning segmentation. Built upon 3D Gaussian Splatting (3DGS), REALM lifts image-level reasoning into 3D space by combining a LVLM-based Visual Segmenter (LVSeg) with a Global-to-Local Grouping (GLGroup) strategy. LVSeg leverages the semantic priors of LVLMs to generate reasoning-aware 2D masks, which are subsequently propagated and refined in 3D via GLGroup. This design enables REALM to reason about spatial relationships, resolve ambiguous descriptions, and understand contextual cues across multi-view inputs. Furthermore, REALM supports a range of 3D interaction tasks, including object removal, replacement, and style transfer. Extensive experiments show that REALM achieves state-of-the-art performance in interpreting implicit instructions on our manually annotated LERF and 3D-OVS benchmarks.

📺 Video

🏃 we propose GAP3DS, an affordance-aware human motion prediction (HMP) model that enhances realism and accuracy in real-world 3D environments.

🥳 GAP3DS's Results

😇 SIF3D's Results

BibTeX