Can you also make one for the captioner?
https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner
I’d really appreciate it if you could make it.
Additionally, I hope you could also extract the vision transformer.
https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
Hi bro, do you find captioner encoder now?
As far as I understand, the encoder of the Captioner model is the same as that of the Instruct model. Is there any difference between them?
As far as I understand, the encoder of the Captioner model is the same as that of the Instruct model. Is there any difference between them?
Hey, I think there’s one thing you’re missing: the Captioner checkpoint went through a post-training full-parameter fine-tuning stage. Even though this fine-tuning was done jointly with the rest of the model rather than on the encoder alone, we can still reasonably treat the Captioner encoder as a stronger, more general-purpose audio representation model.
As far as I understand, the encoder of the Captioner model is the same as that of the Instruct model. Is there any difference between them?
I’m working on a paper comparing different audio encoders. Would it be possible for you to provide a standalone encoder checkpoint for the Captioner model, or some guidance / code on how to extract it? It would be extremely helpful for my research and would save a lot of time. Many thanks in advance for your help!
Best regards,
mifanbushipeicai
非常感谢,并且期待着
Got it.
Please wait a moment while I get things ready.
Thanks a lot, really looking forward to it!
I've uploaded my collection here. The inference code is provisional, so some may not work.
https://huggingface.co/collections/Atotti/alm-audio-encoders
I've uploaded my collection here. The inference code is provisional, so some may not work.
https://huggingface.co/collections/Atotti/alm-audio-encoders
Thank you a lot !!!