microsoft
/

VibeVoice-Realtime-0.5B

vibevoice_streaming

Streaming text input

Long-form speech generation

Model card Files Files and versions

Wenhui Wang commited on 1 day ago

Commit

3f2d420

·

1 Parent(s): 5212720

add multilingual support notes

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -16,6 +16,8 @@ VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model support
 [▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc) (Launch your own realtime demo via the websocket example in [Usage](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#usage-1-launch-real-time-websocket-demo))
 The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
 Key features:
@@ -126,4 +128,4 @@ Users are responsible for sourcing their datasets legally. This may include secu
 ## Contact
 This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at [email protected].
-If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.

 [▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc) (Launch your own realtime demo via the websocket example in [Usage](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#usage-1-launch-real-time-websocket-demo))
+Although the model is primarily built for English, we found that it still exhibits a certain level of multilingual capability—and even performs reasonably well in some languages. We provide nine additional languages (German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish) for users to explore and share feedback.
 The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
 Key features:
 ## Contact
 This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at [email protected].
+If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.