MotionEcho

Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation

¹Zhejiang University of Technology, ²University of New South Wales, ³University of Adelaide, ⁴Zhejiang University
^*Indicates Equal Contribution. ^†Indicates Corresponding Author.

Abstract

Distilled video generation models offer fast and efficient synthesis but struggle with motion customization when guided by reference videos, especially under training-free settings. Existing training-free methods, originally designed for standard diffusion models, fail to generalize due to the accelerated generative process and large denoising steps in distilled models. To address this, we propose MotionEcho, a novel training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing. Our approach uses high-quality, slow teacher models to guide the inference of fast student models through endpoint prediction and interpolation. To maintain efficiency, we dynamically allocate computation across timesteps according to guidance needs. Extensive experiments across various distilled video generation models and benchmark datasets demonstrate that our method significantly improves motion fidelity and generation quality while preserving high efficiency.

Overview of MotionEcho guidance — **(a)** We visualize motion representations from key temporal attention maps of the denoising U-Net. Our method yields better alignment with the reference, capturing more coherent and consistent motion patterns. **(b)** Illustration of the test-time distillation process with teacher guidance. Compared to directly combining motion control with the distilled model (gray path), our method more effectively aligns generation with the reference motion (pink path) **(c)** The student (①) and teacher (②) models perform motion customization via motion loss gradients, with teacher guidance injected through prediction interpolation (③) at sub-interval endpoint.

Pipeline

In MotionEcho pipeline, given a reference video, motion priors are extracted to initialize the student model with a motion-preserving noisy latent. During inference, the teacher (top) and student (bottom) models perform motion customization using motion loss gradients. Teacher guidance is applied via prediction interpolation at sub-interval endpoints. The student then generates the final video in a few steps with high motion fidelity.

Qualitative Evaluation

We directly apply existing zero-shot motion customization methods (e.g., MotionClone) to a distilled, fast T2V model (such as T2V-Turbo-V2), which is called MotionClone + T2V-Turbo-V2. The results reveal significant shortcomings: the generated videos exhibit temporal inconsistencies, motion artifacts, and misinterpretations. For instance, in the example of aligning the head movements of a fox and a dog, the fox's head does not follow the reference motion of the dog. In contrast, our proposed MotionEcho method achieves unified and accurate transfer of both object and camera motion, delivering higher motion fidelity while maintaining low inference time.

Legend: Green prompt denotes MotionEcho; Blue prompt denotes MotionClone + T2V-Turbo-V2.

Quantitative Evaluation

We compare MotionEcho with baseline methods on two benchmarks corresponding to different base models. Ours method (MC+TurboV2 16 steps) achieves the best performance across text alignment, motion fidelity, and FID score, while maintaining competitive temporal consistency within just 13 seconds. Even at 8 steps, it outperforms most baselines across all metrics in 9 s, and at 4 steps maintains a solid FID of 347.91 and strong temporal coherence in only 6 s. In contrast, Control-A-Video and MotionDirector show similar or higher inference times but significantly lower scores in key quality metrics and require costly training. Additionally, we apply our method into other distilled video models (e.g., AnimateDiff-Lighting (AD-L) in Table 2) further verify the effectiveness, superiority and flexibility of our method. The bar chart below presents win‐rate percentages over 36 samples from a user study, evaluated on four subjective criteria—Text Alignment, Temporal Consistency, Motion Fidelity, and Appearance Appeal—highlighting the perceptual advantage of our approach.

BibTeX

@misc{rong2025trainingfreemotioncustomizationdistilled, title={Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation}, author={Jintao Rong and Xin Xie and Xinyi Yu and Linlin Ou and Xinyu Zhang and Chunhua Shen and Dong Gong}, year={2025}, eprint={2506.19348}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.19348}, }

Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation

Abstract

Pipeline

Qualitative Evaluation

Quantitative Evaluation

MotionEcho for TurboV2 16 steps, 8 steps, 4 steps

MotionEcho for AD-L 8 steps, 4 steps

BibTeX