Seeing the Wind from a Falling Leaf

Zhiyuan Gao^*,¶, Jiageng Mao^*,¶, Hong-Xing Yu^$, Haozhe Lou^¶,

Emily Yue-ting Jia^¶, Jernej Barbič^¶, Jiajun Wu^$, Yue Wang^¶

^* Equal Contribution
^¶ University of Southern California ^$ Stanford University

NeurIPS 2025

arXiv Code (Coming Soon)

Physics-Based Video Generation

Original

New Object Insertion

2x Mass

New Object Insertion

1.5x External Force

Original

New Object Insertion

2x External Force

Physics-Based Motion Editing

Original Video

Modify Boundary Condition

Original Video

Modify Boundary Condition

Force Field Visualization

Original

Force Field

Original

Force Field

Original

Force Field

Original

Force Field

Original

Force Field

Original

Force Field

Original

Force Field

Abstract

A longstanding goal in computer vision is to model motions from videos, while the representations behind motions, i.e. the invisible physical interactions that cause objects to deform and move, remain largely unexplored. In this paper, we study how to recover the invisible forces from visual observations, e.g., estimating the wind field by observing a leaf falling to the ground. Our key innovation is an end-to-end differentiable inverse graphics framework, which jointly models object geometry, physical properties, and interactions directly from videos. Through backpropagation, our approach enables the recovery of force representations from object motions. We validate our method on both synthetic and real-world scenarios, and the results demonstrate its ability to infer plausible force fields from videos. Furthermore, we show the potential applications of our approach, including physics-based video generation and editing. We hope our approach sheds light on understanding and modeling the physical process behind pixels, bridging the gap between vision and physics.

We propose a differentiable inverse graphics framework to recover invisible forces from videos by integrating object modeling, physics simulation, and optimization:

Objects are represented with 3D Gaussians and assigned physical properties via Vision-Language Models.
Forces are modeled as a causal tri-plane, capturing underlying dynamics.
Object motions are simulated using a differentiable physics simulator.
A sparse tracking objective and pulsed-force optimization enable differentiable force recovery from videos.