SkyEyes Teaser
SkyEyes is a framework that transforms aerial imagery into realistic street view sequences using 3D Gaussian Splatting, diffusion models, and a constrained optimization strategy to enhance cross-view synthesis quality.

Abstract

Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. A view consistency module ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. As far as we have known, there are no publicly available datasets that contains pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches.

Visual Results

MatrixCity
Carla
Aerial
Ground
Aerial
Ground
Aerial
Ground

Qualitative Comparison

MatrixCity
Carla
Aerial
SuGaR
GVG
Ours
GT

Quantitative Comparisons

MatrixCity Dataset

MatrixCity FID ↓ PSNR ↑ SSIM ↑ LPIPS ↓ KVD ↓ FVD ↓
MVS 359.15 27.79 0.30 0.63 377.20 2846.69
NeRF 317.09 27.94 0.28 0.68 382.57 2390.31
3DGS 245.24 28.13 0.42 0.62 340.62 1926.74
SuGaR 260.51 28.13 0.38 0.60 204.20 1157.64
ControlNet 63.47 28.08 0.25 0.57 281.89 1205.81
Instruct-P2P 100.47 28.04 0.25 0.58 428.88 1742.12
GVG 29.62 28.29 0.33 0.47 141.33 715.97
Ours 54.73 32.22 0.45 0.47 117.93 528.65

Carla Dataset

CARLA FID ↓ PSNR ↑ SSIM ↑ LPIPS ↓ KVD ↓ FVD ↓
MVS 388.37 27.82 0.40 0.53 562.21 3606.30
NeRF 248.16 27.98 0.51 0.68 618.43 2571.87
3DGS 228.92 28.32 0.59 0.48 573.05 2404.44
SuGaR 202.38 28.13 0.53 0.48 679.40 2498.16
ControlNet 75.26 27.97 0.58 0.50 277.89 1056.69
Instruct-P2P 202.12 27.80 0.38 0.65 707.08 3327.93
GVG 45.73 28.29 0.53 0.47 266.46 913.07
Ours 57.95 33.37 0.69 0.44 218.29 693.28

Long Video Generation

Aerial
Ground
Aerial
Ground

Pipeline

Skyeyes Pipeline
(a) Overview of Skyeyes Pipeline: Our approach commences with the utilization of SuGaR. This stage involves processing aerial images and camera poses to train the model for generating ground view priors. After that, we train an appearance control module to generate photo-realistic street images. (b) Spatial-Temporal Self-Attention Module: In the final stage, our view consistency module integrates temporal modeling to ensure spatial and temporal coherence across different views. This module, akin to a spatial-temporal self-attention mechanism, guarantees the consistency and continuity of the scene's depiction across various perspectives. At inference time, given a sequence of ground view priors rendered from SuGaR, our view consistency module can generate photo-realistic and temporal consistent ground view sequence by denoising from pure Gaussian noise.

BibTeX


    @misc{gao2024skyeyesgroundroamingusing,
        title={Skyeyes: Ground Roaming using Aerial View Images}, 
        author={Zhiyuan Gao and Wenbin Teng and Gonglin Chen and Jinsen Wu and Ningli Xu and Rongjun Qin and Andrew Feng and Yajie Zhao},
        year={2024},
        eprint={2409.16685},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2409.16685}, 
  }