CAFT: Aligning Forest and Trees in Images and Long Captions

Hierarchical vision-language learning diagram. — **Figure 2:** Hierarchical vision-language learning. A correct understanding of the whole is established by understanding its constituent parts. We place part-level alignment (*trees*) beneath whole-level alignment (*forest*), so that whole semantics are built from localized part semantics.

Abstract

Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image.

To this end, we propose CAFT (Cross-domain Alignment of Forests and Trees), a vision-language model that jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. Exploiting the organization of long captions, where local descriptions often correspond to scene parts, CAFT employs a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose them into a global image-text representation.

Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that CAFT learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.

Fine-grained image-text retrieval challenge. — **Figure 1:** Fine-grained image-text retrieval requires composing local evidence into global understanding. (a) Among visually similar bus images, only the correct image contains all entities described in the long caption, including the hatchback car and the stone building. Correct retrieval therefore requires identifying what appears where in the image and using this localized evidence to infer the full scene composition. (b) Under this setting, our model retrieves the correct image while the baseline fails.

Method

CAFT jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. The vision branch performs fine-to-coarse scene parsing starting from superpixel tokens, while the language branch encodes long captions hierarchically via a Part-text Transformer and a Whole-text Transformer. A hierarchical cross-domain alignment objective ties the two branches together at matched granularities.

Overview of the CAFT framework. — **Figure 3:** Overview of CAFT. The model constructs hierarchical representations for both vision and language, aligning them at matched granularities. *Vision Branch:* Starting from superpixel tokens, the model performs fine-to-coarse scene parsing via progressive token grouping interleaved with ViT blocks. *Language Branch:* A Part-text Transformer encodes each part-text independently, followed by a Whole-text Transformer that aggregates them into a holistic embedding. Intermediate visual features align with part-text embeddings (*aligning trees*), while final visual features align with whole-text embeddings to capture global scene semantics (*aligning forest*).

Results

Long-Text Image–Text Retrieval

CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks, improving R@1 by up to 5.4% over prior methods, and exhibits strong scaling behavior.

DCI DOCCI ShareGPT4V-1k ShareGPT4V-10k Urban-1k IIW

Spatially Precise Grounding

BibTeX

@article{woo2026aligning, title = {Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding}, author = {Woo, Byeongju and Wang, Zilin and Pak, Byeonghyun and Mo, Sangwoo and Yu, Stella X}, journal = {arXiv preprint arXiv:2602.02977}, year = {2026} }