Motion Prompting: Controlling Video Generation with Motion Trajectories

Daniel Geng^1,2 Charles Herrmann^1,+ Junhwa Hur¹ Forrester Cole¹ Serena Zhang¹ Tobias Pfaff¹ Tatiana Lopez-Guevara¹ Carl Doersch¹ Yusuf Aytar¹ Michael Rubinstein¹ Chen Sun^1,3 Oliver Wang¹ Andrew Owens² Deqing Sun¹

¹Google DeepMind ²University of Michigan ³Brown University ⁺Project Lead

CVPR 2025 (Oral)

Paper arXiv Slides (200 MB) alphaXiv BibTeX
"Interacting" Gallery Camera Control Gallery Motion Transfer Gallery

Step 1: Train a Track Conditioned Video Model

Step 2: Prompt the model with Motion Prompts

Object Control

Emergent Physics

Control with Geometry

Camera Control

Object + Camera Control

Motion Transfer

How to Read these Visualizations

We visualize tracks and first frame inputs on the left and the generated video on the right. The tracks have trails to indicate their trajectory, and are colored to help distinguish them. Some motion prompts are created by converting user mouse inputs, in which case we visualize the mouse motions and drags by placing a cursor where the mouse is, and a hand if the user is dragging. Please note that this does not indicate that the videos are generated real-time. In fact, it takes about 12 minutes to sample each video. Also, because these videos are generated in one pass, the motion is not causal (see here for an example of this).

All videos on this page are cherry-picked from four samples, except for those in the "uncurated" section, in which we show uncurated samples generated by our model.

Overview

We first train a video generation model conditioned on any motion. To do this, we condition our model on point trajectories [19] [20] [21] — an incredibly flexible representation of motion. This allows us to encode the motion of just single points or thousands of points, the motion of specific objects or of a global scene, and even occlusions and temporally sparse motion.

To actually use this model, we construct motion prompts. In much the same way we might prompt an LLM, we can use motion prompts to tease out different capabilities from our video model. Depending on how we do this, we can elicit a large range of behavior, such as object control, emergent physics, camera control, simultaneous object and camera control, drag-based image editing, motion magnification, and motion transfer.

Interacting with Images

We illustrate one simple way to construct a motion prompt above. We can take user mouse motions and drags, and place a grid of tracks wherever the mouse is being dragged. The result is similar to prior and concurrent work on sparse trajectory control [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [26], but due to the flexibility of our motion representation, we can also drag multiple times, release the mouse, or pin the background still with static tracks.

We also find that some inputs result in emergent phenomena, such as with the smoke, pine tree, sand, hair, or cow below. This is particularly exciting because it shows the potential for video models to be general world models, and the potential for motion to be a way of interacting with and querying these world models. In addition, while the results below are neither real-time nor causal (see above) we believe that they show the potential of future video generation models as they get faster, more efficient, and more powerful.

Track Conditioning enables Prediction

One interesting thing to note: Because our method works for temporally sparse track conditioning, we can effectively do prediction. For example, in the above samples we interact with an image briefly, roughly asking questions such as "what happens if I shake the branches of this tree?" or "how will the hair behave if I pull on it?" In fact, in theory any temporally sparse conditioning signal should enable this capability. This suggests that scaling and improving these models may eventually lead to systems able to make predictions and answer counterfactuals about the world.

See more examples

Camera Control from Depth

Beyond single drags, we can also design motion prompts to achieve camera control. We do this by first running a monocular depth estimator [12] to get a point cloud, and then projecting points on to a user-provided sequence of cameras, defining the desired camera trajectory.

By doing this, we can move the camera in arcs or circles, with user mouse control, or even get dolly zooms by changing the camera focal length. Note that we never train our model on posed data — this camera control capability simply falls out of training a model to be conditioned on tracks.

Dolly Zooms

Arcing

Circles

Mouse Controlled

See more examples

Object Control with Geometric Primitives

We can also reinterpret mouse motions as manipulating a geometric primitive, such as a sphere. By placing these tracks over an object that can be roughly approximated by the primitive, we can get a motion prompt with more fine-grained control over the object than with sparse mouse tracks alone. Again, the results below are neither real-time nor causal (see above).

In the bottom row we show a funny example of what might happen if you don't use a spherical primitive.

Object and Camera Control by Composition

We can also create motion prompts that result in simultaneous object and camera motion. To do this, we compose a camera control motion prompt with an object control motion prompt by adding the two tracks together. Technically, this is an approximation but is good enough for camera trajectories that are not too extreme. Below we show examples of back and forth camera motion composed with head turns.

Motion Transfer

Some motions are hard to create. In these cases, we can transfer a desired motion from a source video to a first frame. For example, below we puppeteer a monkey and a skull with a moving face and transfer the spinning of the Earth to a cat and a dog.

Surprisingly, we find that our model can transfer motion even when applying extremely out of domain motions to images. For example, we can take the motion of a monkey chewing on a banana and apply it to a bird's eye photo of trees, and a brick wall. In order to do this, we find that we need quite dense tracks — we use 1500, but only visualize a subset below so that the source video is still visible.

The resulting videos exhibit an interesting effect in which pausing the video on any frame removes the percept of the source video. That is, the monkey can only be perceived when the video is playing, where a Gestalt common-fate effect occurs. This is related to classic work on motion perception [13], and a similar effect can be seen here.

See more examples

Drag Based Image Editing

There is a line of work that enables drag-based editing of images [14] [15] [16] [17] [18] [25]. We can also achieve a similar effect by conditioning our model on these drags. The set up is identical to the "Interacting with an Image" examples above. Similar to the insight from [18] [25], the benefit of using a video model is that we inherit a powerful video prior. Prior works also propose "edit-masks," which we can achieve by adding static tracks to the conditioning, as shown on the far right.

Motion Magnification

Another application our model is motion magnification [22] [23] [24]. This task involves taking a video with subtle motions and generating a new video in which these motions have been magnified, so that they are easier to see. We do this by running a tracking algorithm on an input video, smoothing and magnifying the resulting tracks, and then feeding the first frame of the input video and magnified tracks to our model. Below, we show examples in which subtle breathing motion has been magnified.

Failures

Sometimes, motion prompts will have surprising and undesired effects. This often happens when the constructed motion prompt underspecifies the desired behavior. For example, with the snow monkey we want to arc the camera so that it is above the scene. However, the model attributes part of that motion to the monkey itself, resulting in it looking face-down into the water. In other cases, we get failures when not constructing the motion prompts carefully enough. In the cow example we inadverdently pin the horns of the cow to the background. And sometimes, there is just inherent ambiguity in motion, as with the van Gogh example.

Failures can also illuminate limitations of the underlying video model. In the chess example, we see the model spontaneously conjuring a pawn from thin air. A clearly unphysical behavior. And in the lion example we see that the model apparently does not know how a lion should move. These examples suggest that motion prompts might be used to probe the capabilities and limitations of video models.

Effect of Text Prompts

Throughout this webpage we use text prompts that describe the image but not the desired motion, so as to factor out the influence of text on the motion as much as possible. We also try to keep the text prompts simple. Here we show the effect of gradually more complex text prompts. For the most part, empirically we find that the text prompt does not have a significant effect on the motion. Sometimes, prompting for a specific result has unintended consequences such as the pine tree example on the far right.

Effect of Track Sparsity

Track density is a knob that we allow users to tune themselves. Here we show the effect of track density on video generation for a motion prompt with the intended effect of arcing the camera above the scene. As can be seen, for low density tracks the motion is underspecified. For higher density tracks this motion is more precisely determined, but at the expense of giving the model constraints that are too strict. Often, there is a happy middle ground in which we can achieve the desired motion while keeping emergent dynamics from the underlying video prior.

Comparisons

Here we show comparisons between our method, Image Conductor, and DragAnything, with the input motion visualized as a moving red dot. These are exactly the videos that participants in our human study were shown.

Uncurated Samples and Cherry Picking

All videos shown so far have been cherry picked from four samples. Below we show the uncurated, randomly sampled videos. These are exactly the set of videos from which we cherrypick. Note however that there is also implicit bias in how we chose the inputs. For example, we found that our video model performs relatively well on animals, and as such we ended up focusing more on those kinds of inputs.

Aside: The sand video on the far right is a good example of how our model is not causal. The sand in the upper left corner begins to move before the mouse cursor even gets there. It anticipates that the mouse will move the sand there because the model gets the full motion while generating the entire video.

Related Work

This project is built on prior work led by Adam Harley, who (re)introduced point tracking [19] [20], and also work on tracking led by Carl Doersch [21].

There is also a very large body of work on motion conditioned video generation, a subset of which we cite below [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [26]. Very very recent work by Xiao et al., Trajectory Attention, also propose a way to condition video generation models on trajectories, and focuses on the resulting fine-grained camera control and video-editing capabilities.

References

[1] Wang et al., " MotionCtrl: A Unified and Flexible Motion Controller for Video Generation", SIGGRAPH 2024. ↩

[2] Yin et al., " DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory", arXiv, 2023. ↩

[3] Chen et al., " Motion-Conditioned Diffusion Model for Controllable Video Synthesis", arXiv, 2023. ↩

[4] Li et al., " Image Conductor: Precision Control for Interactive Video Synthesis", arXiv, 2024. ↩

[5] Wu et al., " DragAnything: Motion Control for Anything using Entity Representation", ECCV 2024. ↩

[6] Niu et al., " MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model", ECCV 2024. ↩

[7] Wang et al., " VideoComposer: Compositional Video Synthesis with Motion Controllability", arXiv, 2023. ↩

[8] Shi et al., " Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling", SIGGRAPH 2024. ↩

[9] Zhang et al., " Tora: Trajectory-oriented Diffusion Transformer for Video Generation", arXiv, 2024. ↩

[10] Zhou et al., " TrackGo: A Flexible and Efficient Method for Controllable Video Generation", arXiv, 2024. ↩

[11] Lei et al., " AnimateAnything: Consistent and Controllable Animation for video generation", arXiv, 2024. ↩

[12] Piccinelli et al., " UniDepth: Universal Monocular Metric Depth Estimation", CVPR 2024. ↩

[13] Johansson et al., " Visual Perception of Biological Motion and a Model for its Analysis", Perception & Psychophysics, 1973. ↩

[14] Pan et al., " Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold", SIGGRAPH, 2023. ↩

[15] Shi et al., " DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing", CVPR 2024. ↩

[16] Mou et al., " DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models", ICLR 2024. ↩

[17] Geng et al., " Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators", ICLR 2024. ↩

[18] AlZayer et al., " Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos", arXiv, 2024. ↩

[19] Sand and Teller, " Particle Video: Long-Range Motion Estimation Using Point Trajectories", IJCV 2008. ↩

[20] Harley et al., " Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories", ECCV 2022. ↩

[21] Doersch et al., " TAP-Vid: A Benchmark for Tracking Any Point in a Video", NeurIPS 2022. ↩

[22] Liu et al., " Motion Magnification", SIGGRAPH 2005. ↩

[23] Wu et al., " Eulerian Video Magnification for Revealing Subtle Changes in the World", SIGGRAPH 2012. ↩

[24] Wadhwa et al., " Phase-based Video Motion Processing", SIGGRAPH 2013. ↩

[25] Rotstein et al., " Pathways on the Image Manifold: Image Editing via Video Generation", arXiv, 2024. ↩

[26] Hao et al., " Controllable Video Generation with Sparse Trajectories", CVPR, 2018. ↩

BibTeX

@article{geng2024motionprompting,
  author    = {Geng, Daniel and Herrmann, Charles and Hur, Junhwa and Cole, Forrester and Zhang, Serena and Pfaff, Tobias and Lopez-Guevara, Tatiana and Doersch, Carl and Aytar, Yusuf and Rubinstein, Michael and Sun, Chen and Wang, Oliver and Owens, Andrew and Sun, Deqing},
  title     = {Motion Prompting: Controlling Video Generation with Motion Trajectories},
  journal   = {arXiv preprint arXiv:2412.02700},
  year      = {2024},
}