Vision transformer Hook for dense predictions

Mei S, Thies M, Xia Y, Sun Y, Wu F, Fan F, Gu M, Ye C, Huang Y, Christlein V, Maier A (2026)


Publication Type: Journal article

Publication year: 2026

Journal

Book Volume: 179

Article Number: 113818

DOI: 10.1016/j.patcog.2026.113818

Abstract

Pre-trained vision transformers (ViTs) have demonstrated remarkable capability in learning semantically rich image representations. However, their underlying plain architectures yield low-resolution feature maps, lacking essential fine-grained spatial details required for dense prediction tasks. To better transfer the learned visual features, we present ViT-Hook, a novel hybrid backbone compatible with plain ViTs that effectively bridges the gap between global semantic understanding and local spatial encodings. Specifically, our method aims to broaden the scope and impact of ViT from the following perspectives: (1) We propose a simple transformer-decoder-inspired hook module that receives hierarchical CNN features as spatial queries and interacts with expressive ViT features from large-scale pre-training, therefore instantiating general-purpose representations into task-suited ones. (2) ViT-Hook is a plug-and-play solution for powerful vision foundation models, such as DINOv2 and RADIO. In this case, we find that only partially fine-tuning several intermediate ViT layers can outperform previous full fine-tuning methods, while substantially reducing compute and memory burdens with most parameters frozen. (3) We evaluate ViT-Hook with various pre-trained sources on multiple dense prediction tasks, including semantic segmentation, instance segmentation, and object detection. Notably, tested on the unified UperNet and Mask R-CNN frameworks, our ViT-Hook surpasses state-of-the-art by a large margin, achieving 59.7 (+4.7) mIoU on ADE20K val, 55.0 (+3.6) box AP and 48.5 (+3.3) mask AP on COCO val2017. Code is available at https://github.com/siyuan-mei/ViT-Hook.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Mei, S., Thies, M., Xia, Y., Sun, Y., Wu, F., Fan, F.,... Maier, A. (2026). Vision transformer Hook for dense predictions. Pattern Recognition, 179. https://doi.org/10.1016/j.patcog.2026.113818

MLA:

Mei, Siyuan, et al. "Vision transformer Hook for dense predictions." Pattern Recognition 179 (2026).

BibTeX: Download