Images are Worth Variable Length of Representations

1University of California, San Diego, 2University of California, Berkeley, 3University of Washington



Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, We propose DOVE, a Dynamic Output Vision Encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction.

Contribution

  1. We propose DOVE, a visual tokenizer that dynamically generates tokens based on image complexity. Unlike previous visual tokenization, our model supports arbitrary control over the token sequence length in a single parallel forward.
  2. We propose a variant of DOVE that grounds token generation on a text query and its corresponding salient visual regions. This query-conditioned model achieves a higher token compression rate (averaging 68%) and demonstrates stronger semantic representation.
  3. We observe a phenomenon of emergent semantics by probing the latent representation. Compared to other autoencoder-based tokenization methods with fixed-length token representations, our model achieves significantly better performance on classification, vision-language QA, and shows emerging semantic segmentation properties.

Dynamic Vision Tokenizer

DOVE consists of a VQGAN encoder-decoder, a transformer-based dynamic token generator, and a non-autoregressive token decoder. The model first extracts visual features using the VQGAN encoder, combines them with timestep embeddings, and feeds them into the token generator to produce a variable-length sequence of continuous visual tokens. By predicting an EOS token, the model learns to truncate redundant tokens automatically. During training, we jointly optimize image reconstruction quality and EOS prediction, encouraging shorter token sequences while maintaining reconstruction quality.

Q-DOVE: Query-conditioned Tokenization

We introduce Q-DOVE, an extension of DOVE tailored for text-conditioned vision-and-language tasks. The model reads the user's query and reconstructs the input by focusing on semantically relevant regions, thereby further reducing the length of the generated token sequence.

Experimental Results

We report FID scores of the reconstructed images across varying token lengths. Our results show that as token length increases, the reconstruction quality of our model consistently improves. When using the full token length of 256, our method surpasses VQGAN on both the COCO and WIT datasets.

We also observe an emergent semantic phenomenon: compared to fixed-length autoencoder-based tokenization methods, our dynamic reconstruction approach captures richer semantic information. Q-DOVE further enhances this by incorporating query conditioning. On tasks such as Linear Probing, both DOVE and Q-DOVE significantly outperform other autoencoder-based tokenization methods.

BibTeX


@misc{mao2025imagesworthvariablelength,
      title={Images are Worth Variable Length of Representations}, 
      author={Lingjun Mao and Rodolfo Corona and Xin Liang and Wenhao Yan and Zineng Tang},
      year={2025},
      eprint={2506.03643},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.03643}, 
}
  
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA and LLaVA-Med team for giving us access to their models, and open-source projects, including BioMed-CLIP.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaVA and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

License

The source code of this repository is released under the Apache License 2.0. The model license and dataset license are listed on their corresponding webpages.