ChannelTok

Qualitative comparison vs. state-of-the-art (32–256 tokens). Reconstructions from ChannelTok and prior flexible tokenizers (OneDPiece, DOVE, ALIT, KARL, FlexTok) alongside the original. Our method maintains colour consistency, sharpness, and structural fidelity across all token budgets.

Contrasting foreground/background: jellyfish. Against a dark background, our method preserves colour fidelity and fine tentacle structure even at low token budgets where competing methods wash out or blur detail.

Fine-grained detail: butterfly on a flower. Subtle wing textures and edge sharpness emerge progressively with increasing tokens. Our method recovers perceptually salient detail earlier than competing methods.

Varied textures: red mushroom with mossy background. Our method retains fine surface detail and colour fidelity at low token budgets. Competing methods introduce colour artifacts and lose surface texture at 32–64 tokens.

Dark fruit: texture and sheen comparison

Dark subject with surface sheen: round fruit. Competing methods introduce colour artifacts and lose surface highlights at low token counts. Our method maintains perceptual consistency and tonal accuracy across all budgets.

Christmas stocking: text and vibrant colours

Text and vibrant colours: Christmas stocking. Text legibility is challenging at very low budgets across all methods, but our method begins recovering readable structure by 128 tokens while maintaining colour vibrancy throughout.

Complex natural scene: geyser eruption. Our method recovers landscape structure and atmospheric details more faithfully than alternatives, even at moderate token budgets, while avoiding the blurring and colour drift seen in other methods.

Abstract

TL;DR We treat each latent channel as a visual token — a simple shift that yields a lightweight, fast flexible-length tokenizer achieving SOTA perceptual quality (rFID 2.92) with 8.6× faster decoding and 2.1× fewer parameters, while naturally enabling variable-length autoregressive generation.

Leading flexible vision tokenizers achieve state-of-the-art quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN–Transformer hybrid backbone. Employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance.

This allows for flexible compression at inference by simply retaining the first k channels, and naturally enables variable-length autoregressive image generation. We validate on ImageNet, demonstrating consistent quality across diverse token budgets. Our model achieves state-of-the-art perceptual quality (rFID 2.92) while being 8.6× faster in decoding and 2.1× smaller (159M params) than the next-best alternative — establishing channel-wise tokenization as a powerful and practical paradigm for efficient visual representation.

Approach

The encoder compresses an input image into a latent z ∈ ℝ^C×h×w. During training, we stochastically retain only the first k channels (teal), stop gradients through inactive channels (gray), and independently quantize each active channel with Binary Spherical Quantization (BSQ). At inference, varying k gives flexible compression without retraining: 32 channels for coarse structure, 512 for full fidelity. This gives us a quality–bitrate knob at inference with no additional cost.

Results

Best rFID at smallest size. 3.7 rFID at 256 tokens with only 159M params. Lowest of 2.92 rFID at 512 tokens.
8.6× faster than FlexTok. Lightweight CNN–Transformer backbone with no multi-step generative decoder.
7.9× speedup at 32 tokens. LlamaGen retrained on ChannelTok tokens generates coherent images at any token budget, enabling fast autoregressive image generation.
Coarse-to-fine by design. Induced semantic ordering.

Autoregressive Image Generation

Coarse-to-fine generation at 32 to 256 tokens across ImageNet-100 categories. Each row shows the same image at increasing token budgets: global structure emerges early, fine detail fills in as more channels are generated.

Because channels are ordered by importance, image generation is naturally progressive, one channel at a time, coarse to fine. Stop early for a quick sketch, or run to completion for full detail. This opens the door to interactive, anytime-stop autoregressive image generation. At just 32 tokens, generation runs at 7.9× the speed, completing a full image in 0.48s.

* AR generation uses a tokenizer trained on ImageNet-100k with a maximum token budget of 256, a separate setting from the main reconstruction results.

Induced Ordering

Channel Progression. Each column shows the reconstruction at inference as channels are added incrementally, from coarse global structure to fine-grained detail.

Channel Progression. Each column shows the reconstruction at inference as channels are added incrementally, from coarse global structure to fine-grained detail.

Channel Progression. Each column shows the reconstruction at inference as channels are added incrementally, from coarse global structure to fine-grained detail.

Prefix masking during training induces a coarse-to-fine ordering across channels, capturing global structure in early channels followed by progressively finer-grained detail. Without this masking, coherent reconstructions only appear at high token budgets.