Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation

1Zhejiang University of Technology, 2University of New South Wales, 3University of Adelaide, 4Zhejiang University
*Indicates Equal Contribution. Indicates Corresponding Author.

Reference Video (Zoom In)

"Railway for train."

Reference Video (Zoom Out)

"Man stands in his garden."

Reference Video (Orbit Shot)

"A island, on the ocean, sunny day."

Reference Video

"Explorer, walks on the desert."

Reference Video

"Leopard, slowly raises its head."

Reference Video

"A car is driving in a forest."

Abstract

Distilled video generation models offer fast and efficient synthesis but struggle with motion customization when guided by reference videos, especially under training-free settings. Existing training-free methods, originally designed for standard diffusion models, fail to generalize due to the accelerated generative process and large denoising steps in distilled models. To address this, we propose MotionEcho, a novel training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing. Our approach uses high-quality, slow teacher models to guide the inference of fast student models through endpoint prediction and interpolation. To maintain efficiency, we dynamically allocate computation across timesteps according to guidance needs. Extensive experiments across various distilled video generation models and benchmark datasets demonstrate that our method significantly improves motion fidelity and generation quality while preserving high efficiency.

Overview of MotionEcho guidance
(a) We visualize motion representations from key temporal attention maps of the denoising U-Net. Our method yields better alignment with the reference, capturing more coherent and consistent motion patterns. (b) Illustration of the test-time distillation process with teacher guidance. Compared to directly combining motion control with the distilled model (gray path), our method more effectively aligns generation with the reference motion (pink path) (c) The student () and teacher () models perform motion customization via motion loss gradients, with teacher guidance injected through prediction interpolation () at sub-interval endpoint.

Pipeline

In MotionEcho pipeline, given a reference video, motion priors are extracted to initialize the student model with a motion-preserving noisy latent. During inference, the teacher (top) and student (bottom) models perform motion customization using motion loss gradients. Teacher guidance is applied via prediction interpolation at sub-interval endpoints. The student then generates the final video in a few steps with high motion fidelity.

MY ALT TEXT

Qualitative Evaluation

We directly apply existing zero-shot motion customization methods (e.g., MotionClone) to a distilled, fast T2V model (such as T2V-Turbo-V2), which is called MotionClone + T2V-Turbo-V2. The results reveal significant shortcomings: the generated videos exhibit temporal inconsistencies, motion artifacts, and misinterpretations. For instance, in the example of aligning the head movements of a fox and a dog, the fox's head does not follow the reference motion of the dog. In contrast, our proposed MotionEcho method achieves unified and accurate transfer of both object and camera motion, delivering higher motion fidelity while maintaining low inference time.

Legend: Green prompt denotes MotionEcho; Blue prompt denotes MotionClone + T2V-Turbo-V2.

Reference Video (Orbit Shot)

"A island, on the ocean, sunny day."

"A island, on the ocean, sunny day."

Reference Video

"A fox sitting in a snowy mountain."

"A fox sitting in a snowy mountain."

Reference Video (Pan Left)

"desert is captured with a pan left camera."

"desert is captured with a pan left camera."

Reference Video

"Monkeys play with coconuts."

"Monkeys play with coconuts."

Reference Video (Tilt Up)

"snowy filed is captured with a tilt up camera."

"snowy filed is captured with a tilt up camera."

Reference Video

"Snowflakes falling in the wind."

"Snowflakes falling in the wind."

Quantitative Evaluation

We compare MotionEcho with baseline methods on two benchmarks corresponding to different base models. Ours method (MC+TurboV2 16 steps) achieves the best performance across text alignment, motion fidelity, and FID score, while maintaining competitive temporal consistency within just 13 seconds. Even at 8 steps, it outperforms most baselines across all metrics in 9 s, and at 4 steps maintains a solid FID of 347.91 and strong temporal coherence in only 6 s. In contrast, Control-A-Video and MotionDirector show similar or higher inference times but significantly lower scores in key quality metrics and require costly training. Additionally, we apply our method into other distilled video models (e.g., AnimateDiff-Lighting (AD-L) in Table 2) further verify the effectiveness, superiority and flexibility of our method. The bar chart below presents win‐rate percentages over 36 samples from a user study, evaluated on four subjective criteria—Text Alignment, Temporal Consistency, Motion Fidelity, and Appearance Appeal—highlighting the perceptual advantage of our approach.

Evaluation Results Table
Human Evaluation Results

MotionEcho for TurboV2 16 steps, 8 steps, 4 steps


MotionEcho for AD-L 8 steps, 4 steps


BibTeX

        @misc{rong2025trainingfreemotioncustomizationdistilled,
            title={Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation}, 
            author={Jintao Rong and Xin Xie and Xinyi Yu and Linlin Ou and Xinyu Zhang and Chunhua Shen and Dong Gong},
            year={2025},
            eprint={2506.19348},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2506.19348}, 
          }