DiscoForcing: A Unified Framework for Real-Time Audio-Driven
Character Control with Diffusion Forcing

1ShanghaiTech University
*Equal contribution Corresponding author

Figure 1. We introduce DiscoForcing, a real-time, audio-responsive character control system. Given online streaming audio inputs, DiscoForcing causally synthesizes continuous full-body motions in real time. The generated motion supports two deployment settings: (i) avatar interactive control for responsive animation and visualization, and (ii) physics-based humanoid platform by converting the predicted motion into executable humanoid joint commands.

Abstract

We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly (tempo shifts, drops, or user edits). Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

Overview

Figure 2. An overview of DiscoForcing. DiscoForcing encodes live audio into a causal music feature (30 Hz) and generates continuous full-body motion via a diffusion forcing transformer conditioned on the feature and a history buffer (30 Hz). The resulting motion is delivered to (i) an online avatar platform for retargeting and interactive Unity visualization, and (ii) a physical-based humanoid platform that performs IK/interpolation and executes whole-body control with low-level PD tracking.

Visualization Results

Interpolate start reference image.

Figure 3. Demonstration of Our Method. In a strictly causal, bounded-latency online streaming rollout, DiscoForcing keeps the character stationary during silent segments (mute), and immediately generates beat-synchronized full-body dance once music resumes. As the input stream undergoes multiple music transitions, our model adapts in real time to the changing audio while maintaining long-horizon temporal coherence and smooth motion continuity.