CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion

We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based segmentation methods have relied on time-consuming neural scene optimization. While recent 3D Gaussian Splatting has notably improved speed, existing Gaussian-based segmentation methods struggle to produce compact masks, especially in zero-shot segmentation. This issue probably stems from their straightforward assignment of learnable parameters to each Gaussian, resulting in a lack of robustness against cross-view inconsistent 2D machine-generated labels. Our method aims to address this problem by employing Dual Feature Fusion Network as Gaussians' segmentation field. Specifically, we first optimize 3D Gaussians under RGB supervision. After Gaussian Locating, DINO features extracted from images are applied through explicit unprojection, which are further incorporated with spatial features from the efficient point cloud processing network. Feature aggregation is utilized to fuse them in a global-to-local strategy for compact segmentation features. Experimental results show that our model outperforms baselines on both semantic and panoptic zero-shot segmentation task, meanwhile consumes less than 10% inference time compared to NeRF-based methods.

Given only posed RGB images of a 3D scene, our method aims to build an expressive representation to capture geometry, appearance as well as compact segmenting identity of the scene. Our proposed model, CoSSegGaussians, enables compact novel-view 3D-consistent segmentation, while consuming much less rendering time compared to NeRF-based methods. The below figure provides an overview of CoSSegGaussians' architecture.

(a) shows qualitative comparison of our method with other methods on Replica/ScanNet dataset. (b) and (c) present quantitative comparison on semantic and panoptic segmentation respectively, by averaging over all scenes in each dataset.

Use the slider to observe rendered semantic & panoptic segmentation maps from various viewpoints of different methods.

We've visualized the feature field obtained through DINO feature unprojection.

Language-guided segmentation results are provided as an application of our method, based on 2D language-guided segmentation method Text2Seg.

Scene manipulation results are provided as another application.

There are some related works focusing on 3D scene segmentation, such as Semantic-NeRF, DM-NeRF, Panoptic-Lifting, Gaussian Grouping, etc.

Citation

@article{dou2024cosseggaussians,
      title={CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion},
      author={Dou, Bin and Zhang, Tianyu and Ma, Yongjia and Wang, Zhaohui and Yuan, Zejian},
      journal={arXiv preprint arXiv:2401.05925},
      year={2024}
    }

CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion

Semantic & Panoptic segmentation results are presented.

Abstract

Method

Comparisons

Animation

Scene Segmentation

Segmented 3D Gaussians

DINO Feature Field

Applications

Language-guided Segmentation

Text Prompt: Blue Toy

Text Prompt: Potted Plant

Scene Manipulation

Interactive & Detailed Results

Multi-views' Results

Translating of Seat

Removal of Stool

Related Links

Citation