Redundancy grows with depth
Deep layers remain robust under strong token reduction, while shallow layers require higher retention to preserve acoustic detail.
Findings of ACL 2026
Unveiling and Exploiting Redundancy in Large Speech Language Models
Abstract
Large Speech Language Models (LSLMs) typically operate at high token rates to ensure acoustic fidelity, yet this produces sequence lengths far beyond the underlying semantic content. We revisit whether every speech token needs a distinct representation. Through layer-wise oracle interventions, we find a structured redundancy hierarchy: shallow layers encode essential acoustic details, while deep layers exhibit extreme redundancy and allow aggressive compression.
Motivated by this finding, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. Applying it at both input and deep layers compresses speech representations without compromising semantic information. Across ASR, speech QA, and speech translation, the method reduces prefilling FLOPs while maintaining competitive task performance and delivering practical memory and latency gains.
1 Introduction
LSLMs process audio at high token rates to preserve acoustic fidelity, but speech semantics are much sparser than the resulting token stream. This mismatch forces the language backbone to process many redundant tokens and increases inference cost.
The paper first analyzes how redundancy evolves across layers, then uses that analysis to design a training-free compression method grounded in the structure of speech representations.
3 Oracle Interventions
Deep layers remain robust under strong token reduction, while shallow layers require higher retention to preserve acoustic detail.
Intervention sensitivity peaks in transitional layers where representations reorganize from acoustic features toward lexical semantics.
Intrinsic feature similarity captures useful local structure better than rigid, signal-agnostic downsampling.
4 Similarity-Driven Interventions
Affinity Pooling aggregates audio tokens using cosine similarity. A token joins the active group if it matches any of the preceding lookback-window tokens above threshold τ; otherwise, the current group is mean-pooled and a new group begins.
The layer-wise Affinity Pooling probe confirms the same redundancy structure: input and deep layers remain robust, while intermediate layers are sensitive and best left uncompressed.
Compare each token with nearby history inside a small lookback window.
Keep similar neighboring tokens in one group and mean-pool the group representation.
Apply Affinity Pooling to one layer at a time to measure where representations are compressible and where they are fragile.
5 Efficient LSLMs
Across ASR, speech QA, and speech translation, Dual Affinity Pooling remains comparable to the vanilla Qwen2-Audio baseline while carrying far fewer audio tokens through inference. Under the aggressive setting, it keeps only 14.91% of final audio tokens and cuts prefilling FLOPs by 27.48%, yet maintains near-baseline ASR and speech translation quality while slightly improving average QA accuracy.
The deployment results point in the same direction: Affinity Pooling reduces the resources needed for prompt processing, especially on longer utterances. On 40-60 s audio, DAP reaches a 1.70x dynamic-memory saving on an H200 GPU, without extra training or changes to the underlying model.
Citation
@misc{xiang2026needdistinctrepresentationsspeech,
title = {Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models},
author = {Bajian Xiang and Tingwei Guo and Xuan Chen and Yang Han},
year = {2026},
eprint = {2604.06871},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2604.06871}
}