Wenhui Wang
commited on
Commit
·
3f2d420
1
Parent(s):
5212720
add multilingual support notes
Browse files
README.md
CHANGED
|
@@ -16,6 +16,8 @@ VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model support
|
|
| 16 |
|
| 17 |
[▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc) (Launch your own realtime demo via the websocket example in [Usage](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#usage-1-launch-real-time-websocket-demo))
|
| 18 |
|
|
|
|
|
|
|
| 19 |
The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
|
| 20 |
|
| 21 |
Key features:
|
|
@@ -126,4 +128,4 @@ Users are responsible for sourcing their datasets legally. This may include secu
|
|
| 126 |
|
| 127 |
## Contact
|
| 128 |
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at [email protected].
|
| 129 |
-
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|
|
|
|
| 16 |
|
| 17 |
[▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc) (Launch your own realtime demo via the websocket example in [Usage](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#usage-1-launch-real-time-websocket-demo))
|
| 18 |
|
| 19 |
+
Although the model is primarily built for English, we found that it still exhibits a certain level of multilingual capability—and even performs reasonably well in some languages. We provide nine additional languages (German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish) for users to explore and share feedback.
|
| 20 |
+
|
| 21 |
The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
|
| 22 |
|
| 23 |
Key features:
|
|
|
|
| 128 |
|
| 129 |
## Contact
|
| 130 |
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at [email protected].
|
| 131 |
+
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|