ChannelTok: Efficient Flexible-Length Vision Tokenization

Sukriti Paul1Arpit Bansal1Tom Goldstein1

1University of Maryland, College Park

2.92rFID (SOTA)
8.6×Faster decoding
159MParameters
2.1×Smaller than FlexTok

Abstract

TL;DR We treat each latent channel as a visual token — a simple shift that yields a lightweight, fast flexible-length tokenizer achieving SOTA perceptual quality (rFID 2.92) with 8.6× faster decoding and 2.1× fewer parameters, while naturally enabling variable-length autoregressive generation.

Leading flexible vision tokenizers achieve state-of-the-art quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN–Transformer hybrid backbone. Employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance.

This allows for flexible compression at inference by simply retaining the first k channels, and naturally enables variable-length autoregressive image generation. We validate on ImageNet, demonstrating consistent quality across diverse token budgets. Our model achieves state-of-the-art perceptual quality (rFID 2.92) while being 8.6× faster in decoding and 2.1× smaller (159M params) than the next-best alternative — establishing channel-wise tokenization as a powerful and practical paradigm for efficient visual representation.

Approach

ChannelTok architecture overview

The encoder compresses an input image into a latent z ∈ ℝC×h×w. During training, we stochastically retain only the first k channels (teal), stop gradients through inactive channels (gray), and independently quantize each active channel with Binary Spherical Quantization (BSQ). At inference, varying k gives flexible compression without retraining: 32 channels for coarse structure, 512 for full fidelity. This gives us a quality–bitrate knob at inference with no additional cost.

Results

Quality-efficiency Pareto comparison
  • Best rFID at smallest size. 3.7 rFID at 256 tokens with only 159M params. Lowest of 2.92 rFID at 512 tokens.
  • 8.6× faster than FlexTok. Lightweight CNN–Transformer backbone with no multi-step generative decoder.
  • 7.9× speedup at 32 tokens. LlamaGen retrained on ChannelTok tokens generates coherent images at any token budget, enabling fast autoregressive image generation.
  • Coarse-to-fine by design. Induced semantic ordering.

Autoregressive Image Generation

Autoregressive image generation across token budgets
Coarse-to-fine generation at 32 to 256 tokens across ImageNet-100 categories. Each row shows the same image at increasing token budgets: global structure emerges early, fine detail fills in as more channels are generated.

Because channels are ordered by importance, image generation is naturally progressive, one channel at a time, coarse to fine. Stop early for a quick sketch, or run to completion for full detail. This opens the door to interactive, anytime-stop autoregressive image generation. At just 32 tokens, generation runs at 7.9× the speed, completing a full image in 0.48s.

* AR generation uses a tokenizer trained on ImageNet-100k with a maximum token budget of 256, a separate setting from the main reconstruction results.

Induced Ordering

Prefix masking during training induces a coarse-to-fine ordering across channels, capturing global structure in early channels followed by progressively finer-grained detail. Without this masking, coherent reconstructions only appear at high token budgets.

Semantic Transfer

Swapping channels between two images at inference reveals the semantic hierarchy encoded by ChannelTok. Up to 64 tokens, global structure is preserved with only subtle stylistic transfer. Beyond 64 tokens, the foreground object itself begins to metamorphose, reflecting the finer semantic content carried by later channels.

BibTeX

If you find our work useful, please cite:

@misc{paul2026channeltok,
  title     = {ChannelTok: Efficient Flexible-Length Vision Tokenization},
  author    = {Paul, Sukriti and Bansal, Arpit and Goldstein, Tom},
  journal   = {arXiv preprint},
  year      = {2026},
}