Do We Need Distinct Representations for Every Speech Token?

Abstract

Speech Tokens Are Not Equally Necessary

Large Speech Language Models (LSLMs) typically operate at high token rates to ensure acoustic fidelity, yet this produces sequence lengths far beyond the underlying semantic content. We revisit whether every speech token needs a distinct representation. Through layer-wise oracle interventions, we find a structured redundancy hierarchy: shallow layers encode essential acoustic details, while deep layers exhibit extreme redundancy and allow aggressive compression.

Motivated by this finding, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. Applying it at both input and deep layers compresses speech representations without compromising semantic information. Across ASR, speech QA, and speech translation, the method reduces prefilling FLOPs while maintaining competitive task performance and delivering practical memory and latency gains.

1 Introduction

Why Speech Token Redundancy Matters

LSLMs process audio at high token rates to preserve acoustic fidelity, but speech semantics are much sparser than the resulting token stream. This mismatch forces the language backbone to process many redundant tokens and increases inference cost.

The paper first analyzes how redundancy evolves across layers, then uses that analysis to design a training-free compression method grounded in the structure of speech representations.

3 Oracle Interventions

Anatomy of Redundancy

Redundancy grows with depth

Deep layers remain robust under strong token reduction, while shallow layers require higher retention to preserve acoustic detail.

Middle layers are fragile

Intervention sensitivity peaks in transitional layers where representations reorganize from acoustic features toward lexical semantics.

Similarity beats fixed rates

Intrinsic feature similarity captures useful local structure better than rigid, signal-agnostic downsampling.

Framework of oracle intervention experiments. — Oracle intervention framework: align audio tokens to semantic units and compress a single layer at a time.

Layer-wise oracle intervention results on Qwen2-Audio and Kimi-Audio. — Layer-wise oracle interventions reveal a depth-dependent redundancy hierarchy in both Qwen2-Audio and Kimi-Audio.

Layer-wise cosine similarity dynamics for Qwen2-Audio and Kimi-Audio. — Cosine similarity dynamics explain why deep layers can tolerate aggressive merging.

4 Similarity-Driven Interventions

Affinity Pooling

Affinity Pooling aggregates audio tokens using cosine similarity. A token joins the active group if it matches any of the preceding lookback-window tokens above threshold τ; otherwise, the current group is mean-pooled and a new group begins.

The layer-wise Affinity Pooling probe confirms the same redundancy structure: input and deep layers remain robust, while intermediate layers are sensitive and best left uncompressed.

Layer-wise dynamics of Affinity Pooling on Qwen2-Audio and Kimi-Audio. — Layer-wise Affinity Pooling dynamics across thresholds.

Scan locally

Compare each token with nearby history inside a small lookback window.

Merge by affinity

Keep similar neighboring tokens in one group and mean-pool the group representation.

Probe layer by layer

Apply Affinity Pooling to one layer at a time to measure where representations are compressible and where they are fragile.

Visualization of Affinity Pooling token groups across layers. — Merged token groups expand from fragmented acoustic-level chunks to broad semantic abstractions in deeper layers.

5 Efficient LSLMs

Efficiency Without Sacrificing Semantics

27.48% Prefilling FLOPs reduction with aggressive Dual Affinity Pooling.

14.91% Final audio token retention ratio under the aggressive DAP setting.

1.70x Dynamic memory saving on 40-60 s utterances in H200 deployment.

0 Extra training required by the proposed similarity-based compression.

Across ASR, speech QA, and speech translation, Dual Affinity Pooling remains comparable to the vanilla Qwen2-Audio baseline while carrying far fewer audio tokens through inference. Under the aggressive setting, it keeps only 14.91% of final audio tokens and cuts prefilling FLOPs by 27.48%, yet maintains near-baseline ASR and speech translation quality while slightly improving average QA accuracy.

The deployment results point in the same direction: Affinity Pooling reduces the resources needed for prompt processing, especially on longer utterances. On 40-60 s audio, DAP reaches a 1.70x dynamic-memory saving on an H200 GPU, without extra training or changes to the underlying model.

Citation

BibTeX

@misc{xiang2026needdistinctrepresentationsspeech,
  title = {Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models},
  author = {Bajian Xiang and Tingwei Guo and Xuan Chen and Yang Han},
  year = {2026},
  eprint = {2604.06871},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url = {https://arxiv.org/abs/2604.06871}
}