MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Anonymous
Anonymous Institution
Quick look of MagicDrive3D

MagicDrive3D generates highly realistic 3D street scenes with diverse contorls.

Abstract

While controllable generative models for images and videos have achieved remarkable success, high-quality and controllable generation for 3D scenes, particularly in unbounded scenarios like autonomous driving, remains underdeveloped due to high data acquisition requirements. In this paper, we introduce MagicDrive3D, a novel framework for controllable 3D street scene generation that integrates video-based view synthesis with 3D representation (3DGS) generation. This approach supports multi-condition control, including road maps, 3D objects, and text descriptions. Unlike previous methods that acquire 3D representation before training generative models, MagicDrive3D begins by training a multi-view video generation model to synthesize diverse street views. This innovative approach leverages routinely collected autonomous driving data (e.g., nuScenes), significantly reducing data acquisition challenges and enhancing the richness of 3D scene generation. In the 3DGS generation step, we introduce Fault-Tolerant Gaussian Splatting to address minor errors in the generated content. We also introduce monocular depth for better initialization prior, and appearance modeling to handle exposure discrepancies across different viewpoints. Experiments demonstrate that MagicDrive3D generates diverse, high-quality 3D driving scenes that support any-view rendering and enhance downstream tasks like BEV segmentation, showcasing its potential for autonomous driving simulation and beyond.

left: bbox condition (one of our inputs). right: 3D scene generated by MagicDrive3D.
All the 3D scenes are fully generated MagicDrive3D without any input camera views (NOT reconstruction).
If the videos on the left and right are out of sync, please refresh the page.

Method

Algorithm description of
        MagicDrive3D

For controllable street 3D scene generation, MagicDrive3D decomposes the task into two steps: ① conditional multi-view video generation, which tackles the control signals and generates consistent view priors to the novel scene; and ② Gaussian Splatting generation with our Enhanced GS pipeline, which supports various viewpoint rendering (e.g., panorama).

Controllability

Precise control over objects and some road sematics is available by MagicDrive3D. Besides, text control is also applicable!

Editing 1
Editing 2
Editing 3

Data Engine

Controllable street scene generation ability makes MagicDrive3D a powerful data engine. We show how generated scenes can help to improve the viewpoint robustness on CVT.

Downstream performance

Ablation Study (Bullet Time!)

Video generation proposed by MagicDrive3D make it possible for static scene generation (left: although the data is collected with scene dynamic, we can generate static scene videos), facilitating scene 3DGS generation (right: we propose improved FTGS pipeline which performs much better than the original 3DGS). It is like creating novel bullet-time scenes from driving dataset.

Video will start from the beginning on switching.

left/above: multi-view video for static scene generated by the 1st step of MagicDrive3D.
right/below: final generated 3D scene from MagicDrive3D (click button to change between Bbox and 3DGS ablation).