Skyeyes: Ground Roaming using Aerial View Images

^* Equal Contribution

Abstract

Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. A view consistency module ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. As far as we have known, there are no publicly available datasets that contains pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches.

Visual Results

MatrixCity

Carla

Aerial

Ground

Aerial

Ground

Aerial

Ground

Qualitative Comparison

MatrixCity

Carla

Aerial

SuGaR

GVG

Ours

Quantitative Comparisons

MatrixCity Dataset

MatrixCity	FID ↓	PSNR ↑	SSIM ↑	LPIPS ↓	KVD ↓	FVD ↓
MVS	359.15	27.79	0.30	0.63	377.20	2846.69
NeRF	317.09	27.94	0.28	0.68	382.57	2390.31
3DGS	245.24	28.13	0.42	0.62	340.62	1926.74
SuGaR	260.51	28.13	0.38	0.60	204.20	1157.64
ControlNet	63.47	28.08	0.25	0.57	281.89	1205.81
Instruct-P2P	100.47	28.04	0.25	0.58	428.88	1742.12
GVG	29.62	28.29	0.33	0.47	141.33	715.97
Ours	54.73	32.22	0.45	0.47	117.93	528.65

Carla Dataset

CARLA	FID ↓	PSNR ↑	SSIM ↑	LPIPS ↓	KVD ↓	FVD ↓
MVS	388.37	27.82	0.40	0.53	562.21	3606.30
NeRF	248.16	27.98	0.51	0.68	618.43	2571.87
3DGS	228.92	28.32	0.59	0.48	573.05	2404.44
SuGaR	202.38	28.13	0.53	0.48	679.40	2498.16
ControlNet	75.26	27.97	0.58	0.50	277.89	1056.69
Instruct-P2P	202.12	27.80	0.38	0.65	707.08	3327.93
GVG	45.73	28.29	0.53	0.47	266.46	913.07
Ours	57.95	33.37	0.69	0.44	218.29	693.28

Long Video Generation

Aerial

Ground

Aerial

Ground

Pipeline

(a) Overview of Skyeyes Pipeline: Our approach commences with the utilization of SuGaR. This stage involves processing aerial images and camera poses to train the model for generating ground view priors. After that, we train an appearance control module to generate photo-realistic street images. (b) Spatial-Temporal Self-Attention Module: In the final stage, our view consistency module integrates temporal modeling to ensure spatial and temporal coherence across different views. This module, akin to a spatial-temporal self-attention mechanism, guarantees the consistency and continuity of the scene's depiction across various perspectives. At inference time, given a sequence of ground view priors rendered from SuGaR, our view consistency module can generate photo-realistic and temporal consistent ground view sequence by denoising from pure Gaussian noise.

BibTeX

@misc{gao2024skyeyesgroundroamingusing, title={Skyeyes: Ground Roaming using Aerial View Images}, author={Zhiyuan Gao and Wenbin Teng and Gonglin Chen and Jinsen Wu and Ningli Xu and Rongjun Qin and Andrew Feng and Yajie Zhao}, year={2024}, eprint={2409.16685}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2409.16685}, }