We illustrate one simple way to construct a motion prompt above. We can take user mouse motions and drags, and place a grid of tracks wherever the mouse is being dragged. The result is similar to prior and concurrent work on sparse trajectory control [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11], but due to the flexibility of our motion representation, we can also drag multiple times, release the mouse, or pin the background still with static tracks.
We also find that some inputs result in emergent phenomena, such as with the smoke, pine tree, sand, hair, or cow below. This is particularly exciting because it shows the potential for video models to be general world models, and the potential for motion to be a way of interacting with and querying these world models. In addition, while the results below are neither real-time nor causal (see above) we believe that they show the potential of future video generation models as they get faster, more efficient, and more powerful.
Beyond single drags, we can also design motion prompts to achieve camera control. We do this by first running a monocular depth estimator [12] to get a point cloud, and then projecting points on to a user-provided sequence of cameras, defining the desired camera trajectory.
By doing this, we can move the camera in arcs or circles, with user mouse control, or even get dolly zooms by changing the camera focal length. Note that we never train our model on posed data — this camera control capability simply falls out of training a model to be conditioned on tracks.
We can also reinterpret mouse motions as manipulating a geometric primitive, such as a sphere. By placing these tracks over an object that can be roughly approximated by the primitive, we can get a motion prompt with more fine-grained control over the object than with sparse mouse tracks alone. Again, the results below are neither real-time nor causal (see above).
In the bottom row we show a funny example of what might happen if you don't use a spherical primitive.
We can also create motion prompts that result in simultaneous object and camera motion. To do this, we compose a camera control motion prompt with an object control motion prompt by adding the two tracks together. Technically, this is an approximation but is good enough for camera trajectories that are not too extreme. Below we show examples of back and forth camera motion composed with head turns.
Some motions are hard to create. In these cases, we can transfer a desired motion from a source video to a first frame. For example, below we puppeteer a monkey and a skull with a moving face and transfer the spinning of the Earth to a cat and a dog.
Surprisingly, we find that our model can transfer motion even when applying extremely out of domain motions to images. For example, we can take the motion of a monkey chewing on a banana and apply it to a bird's eye photo of trees, and a brick wall. In order to do this, we find that we need quite dense tracks — we use 1500, but only visualize a subset below so that the source video is still visible.
The resulting videos exhibit an interesting effect in which pausing the video on any frame removes the percept of the source video. That is, the monkey can only be perceived when the video is playing, where a Gestalt common-fate effect occurs. This is related to classic work on motion perception [13], and a similar effect can be seen here.
There is a line of work that enables drag-based editing of images [14] [15] [16] [17] [18] [25]. We can also achieve a similar effect by conditioning our model on these drags. The set up is identical to the "Interacting with an Image" examples above. Similar to the insight from [18] [25], the benefit of using a video model is that we inherit a powerful video prior. Prior works also propose "edit-masks," which we can achieve by adding static tracks to the conditioning, as shown on the far right.
Another application our model is motion magnification [22] [23] [24]. This task involves taking a video with subtle motions and generating a new video in which these motions have been magnified, so that they are easier to see. We do this by running a tracking algorithm on an input video, smoothing and magnifying the resulting tracks, and then feeding the first frame of the input video and magnified tracks to our model. Below, we show examples in which subtle breathing motion has been magnified.
Sometimes, motion prompts will have surprising and undesired effects. This often happens when the constructed motion prompt underspecifies the desired behavior. For example, with the snow monkey we want to arc the camera so that it is above the scene. However, the model attributes part of that motion to the monkey itself, resulting in it looking face-down into the water. In other cases, we get failures when not constructing the motion prompts carefully enough. In the cow example we inadverdently pin the horns of the cow to the background. And sometimes, there is just inherent ambiguity in motion, as with the van Gogh example.
Failures can also illuminate limitations of the underlying video model. In the chess example, we see the model spontaneously conjuring a pawn from thin air. A clearly unphysical behavior. And in the lion example we see that the model apparently does not know how a lion should move. These examples suggest that motion prompts might be used to probe the capabilities and limitations of video models.
Throughout this webpage we use text prompts that describe the image but not the desired motion, so as to factor out the influence of text on the motion as much as possible. We also try to keep the text prompts simple. Here we show the effect of gradually more complex text prompts. For the most part, empirically we find that the text prompt does not have a significant effect on the motion. Sometimes, prompting for a specific result has unintended consequences such as the pine tree example on the far right.
Track density is a knob that we allow users to tune themselves. Here we show the effect of track density on video generation for a motion prompt with the intended effect of arcing the camera above the scene. As can be seen, for low density tracks the motion is underspecified. For higher density tracks this motion is more precisely determined, but at the expense of giving the model constraints that are too strict. Often, there is a happy middle ground in which we can achieve the desired motion while keeping emergent dynamics from the underlying video prior.
Here we show comparisons between our method, Image Conductor, and DragAnything, with the input motion visualized as a moving red dot. These are exactly the videos that participants in our human study were shown.
All videos shown so far have been cherry picked from four samples. Below we show the uncurated, randomly sampled videos. These are exactly the set of videos from which we cherrypick. Note however that there is also implicit bias in how we chose the inputs. For example, we found that our video model performs relatively well on animals, and as such we ended up focusing more on those kinds of inputs.
Aside: The sand video on the far right is a good example of how our model is not causal. The sand in the upper left corner begins to move before the mouse cursor even gets there. It anticipates that the mouse will move the sand there because the model gets the full motion while generating the entire video.
This project is built on prior work led by Adam Harley, who (re)introduced point tracking [19] [20], and also work on tracking led by Carl Doersch [21].
There is also a very large body of work on motion conditioned video generation, a subset of which we cite below [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]. Very very recent work by Xiao et al., Trajectory Attention, also propose a way to condition video generation models on trajectories, and focuses on the resulting fine-grained camera control and video-editing capabilities.
[1] Wang et al., " MotionCtrl: A Unified and Flexible Motion Controller for Video Generation", SIGGRAPH 2024. ↩
[2] Yin et al., " DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory", arXiv, 2023. ↩
[3] Chen et al., " Motion-Conditioned Diffusion Model for Controllable Video Synthesis", arXiv, 2023. ↩
[4] Li et al., " Image Conductor: Precision Control for Interactive Video Synthesis", arXiv, 2024. ↩
[5] Wu et al., " DragAnything: Motion Control for Anything using Entity Representation", ECCV 2024. ↩
[6] Niu et al., " MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model", ECCV 2024. ↩
[7] Wang et al., " VideoComposer: Compositional Video Synthesis with Motion Controllability", arXiv, 2023. ↩
[8] Shi et al., " Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling", SIGGRAPH 2024. ↩
[9] Zhang et al., " Tora: Trajectory-oriented Diffusion Transformer for Video Generation", arXiv, 2024. ↩
[10] Zhou et al., " TrackGo: A Flexible and Efficient Method for Controllable Video Generation", arXiv, 2024. ↩
[11] Lei et al., " AnimateAnything: Consistent and Controllable Animation for video generation", arXiv, 2024. ↩
[12] Piccinelli et al., " UniDepth: Universal Monocular Metric Depth Estimation", CVPR 2024. ↩
[13] Johansson et al., " Visual Perception of Biological Motion and a Model for its Analysis", Perception & Psychophysics, 1973. ↩
[14] Pan et al., " Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold", SIGGRAPH, 2023. ↩
[15] Shi et al., " DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing", CVPR 2024. ↩
[16] Mou et al., " DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models", ICLR 2024. ↩
[17] Geng et al., " Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators", ICLR 2024. ↩
[18] AlZayer et al., " Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos", arXiv, 2024. ↩
[19] Sand and Teller, " Particle Video: Long-Range Motion Estimation Using Point Trajectories", IJCV 2008. ↩
[20] Harley et al., " Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories", ECCV 2022. ↩
[21] Doersch et al., " TAP-Vid: A Benchmark for Tracking Any Point in a Video", NeurIPS 2022. ↩
[22] Liu et al., " Motion Magnification", SIGGRAPH 2005. ↩
[23] Wu et al., " Eulerian Video Magnification for Revealing Subtle Changes in the World", SIGGRAPH 2012. ↩
[24] Wadhwa et al., " Phase-based Video Motion Processing", SIGGRAPH 2013. ↩
[25] Rotstein et al., " Pathways on the Image Manifold: Image Editing via Video Generation", arXiv, 2024. ↩
@article{geng2024motionprompting,
author = {Geng, Daniel and Herrmann, Charles and Hur, Junhwa and Cole, Forrester and Zhang, Serena and Pfaff, Tobias and Lopez-Guevara, Tatiana and Doersch, Carl and Aytar, Yusuf and Rubinstein, Michael and Sun, Chen and Wang, Oliver and Owens, Andrew and Sun, Deqing},
title = {Motion Prompting: Controlling Video Generation with Motion Trajectories},
journal = {arXiv preprint arXiv:2412.02700},
year = {2024},
}