🎤 AnyTalker

Let your characters interact naturally

⚠️ Important Video Duration Limits

  • Fast Mode: Maximum video duration should be less than 4 seconds. Audio inputs longer than 4 seconds will be automatically trimmed to 4 seconds.
Number of Persons (determined by audio inputs)
Audio Processing Mode

Audio Mode Description:

  • pad: Select this if every audio input track has already been zero-padded to a common length.
  • concat: Select this if you want the script to chain each speaker's clips together and then zero-pad the non-speaker segments to reach a uniform length.

Audio Binding Order:

  • Audio inputs are bound to persons based on their positions in the input image, from left to right.
  • Person 1 corresponds to the leftmost person, Person 2 to the middle person (if any), and Person 3 to the rightmost person (if any).
1 1000
-1 2147483647
0 20

Generation Modes:

  • Fast Mode (120s GPU budget, suitable for any type of users): Fixed 8 denoising steps for quick generation. Maximum video duration: 4 seconds.
  • Quality Mode (Dynamic GPU budget): Custom denoising steps (adjustable via "Diffusion steps" slider, default: 25 steps). GPU duration is dynamically calculated as: 60s (preprocessing) + video_seconds × steps × 3 s.

Note: Fast mode has a fixed 120s GPU budget. Quality mode dynamically allocates GPU time based on video length and denoising steps. Multi-person videos generally require longer duration and more Usage Quota for better quality.

Examples
Upload Input Image Prompt Number of Persons (determined by audio inputs) Audio Processing Mode Audio for Person 1 (Leftmost) Audio for Person 2 (Middle)