Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
Abstract
Dr. SHAP-AV framework uses Shapley values to analyze modality contributions in audio-visual speech recognition, revealing how models balance acoustic and visual information under varying noise conditions.
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.
Community
A Shapley-based framework revealing how audio-visual speech recognition models balance what they hear and what they see across architectures, decoding stages, and acoustic conditions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition (2026)
- Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement (2026)
- Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition (2026)
- Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition (2026)
- Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder (2026)
- OCR-Enhanced Multimodal ASR Can Read While Listening (2026)
- CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper