🎤 AnyTalker
Let your characters interact naturally
⚠️ Important Video Duration Limits
- Fast Mode: Maximum video duration should be less than 4 seconds. Audio inputs longer than 4 seconds will be automatically trimmed to 4 seconds.
Audio Mode Description:
- pad: Select this if every audio input track has already been zero-padded to a common length.
- concat: Select this if you want the script to chain each speaker's clips together and then zero-pad the non-speaker segments to reach a uniform length.
Audio Binding Order:
- Audio inputs are bound to persons based on their positions in the input image, from left to right.
- Person 1 corresponds to the leftmost person, Person 2 to the middle person (if any), and Person 3 to the rightmost person (if any).
1 1000
-1 2147483647
0 20
Generation Modes:
- Fast Mode (120s GPU budget, suitable for any type of users): Fixed 8 denoising steps for quick generation. Maximum video duration: 4 seconds.
- Quality Mode (Dynamic GPU budget): Custom denoising steps (adjustable via "Diffusion steps" slider, default: 25 steps). GPU duration is dynamically calculated as: 60s (preprocessing) + video_seconds × steps × 3 s.
Note: Fast mode has a fixed 120s GPU budget. Quality mode dynamically allocates GPU time based on video length and denoising steps. Multi-person videos generally require longer duration and more Usage Quota for better quality.
Examples
| Upload Input Image | Prompt | Number of Persons (determined by audio inputs) | Audio Processing Mode | Audio for Person 1 (Leftmost) | Audio for Person 2 (Middle) |
|---|