DOVE consists of a VQGAN encoder-decoder, a transformer-based dynamic token generator, and a non-autoregressive token decoder. The model first extracts visual features using the VQGAN encoder, combines them with timestep embeddings, and feeds them into the token generator to produce a variable-length sequence of continuous visual tokens. By predicting an EOS token, the model learns to truncate redundant tokens automatically. During training, we jointly optimize image reconstruction quality and EOS prediction, encouraging shorter token sequences while maintaining reconstruction quality.
We introduce Q-DOVE, an extension of DOVE tailored for text-conditioned vision-and-language tasks. The model reads the user's query and reconstructs the input by focusing on semantically relevant regions, thereby further reducing the length of the generated token sequence.
We report FID scores of the reconstructed images across varying token lengths. Our results show that as token length increases, the reconstruction quality of our model consistently improves. When using the full token length of 256, our method surpasses VQGAN on both the COCO and WIT datasets.
We also observe an emergent semantic phenomenon: compared to fixed-length autoencoder-based tokenization methods, our dynamic reconstruction approach captures richer semantic information. Q-DOVE further enhances this by incorporating query conditioning. On tasks such as Linear Probing, both DOVE and Q-DOVE significantly outperform other autoencoder-based tokenization methods.
@misc{mao2025imagesworthvariablelength,
title={Images are Worth Variable Length of Representations},
author={Lingjun Mao and Rodolfo Corona and Xin Liang and Wenhao Yan and Zineng Tang},
year={2025},
eprint={2506.03643},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.03643},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA and LLaVA-Med team for giving us access to their models, and open-source projects, including BioMed-CLIP.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaVA and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
The source code of this repository is released under the Apache License 2.0. The model license and dataset license are listed on their corresponding webpages.