Mei S, Thies M, Xia Y, Sun Y, Wu F, Fan F, Gu M, Ye C, Huang Y, Christlein V, Maier A (2026)
Publication Type: Journal article
Publication year: 2026
Book Volume: 179
Article Number: 113818
DOI: 10.1016/j.patcog.2026.113818
Pre-trained vision transformers (ViTs) have demonstrated remarkable capability in learning semantically rich image representations. However, their underlying plain architectures yield low-resolution feature maps, lacking essential fine-grained spatial details required for dense prediction tasks. To better transfer the learned visual features, we present ViT-Hook, a novel hybrid backbone compatible with plain ViTs that effectively bridges the gap between global semantic understanding and local spatial encodings. Specifically, our method aims to broaden the scope and impact of ViT from the following perspectives: (1) We propose a simple transformer-decoder-inspired hook module that receives hierarchical CNN features as spatial queries and interacts with expressive ViT features from large-scale pre-training, therefore instantiating general-purpose representations into task-suited ones. (2) ViT-Hook is a plug-and-play solution for powerful vision foundation models, such as DINOv2 and RADIO. In this case, we find that only partially fine-tuning several intermediate ViT layers can outperform previous full fine-tuning methods, while substantially reducing compute and memory burdens with most parameters frozen. (3) We evaluate ViT-Hook with various pre-trained sources on multiple dense prediction tasks, including semantic segmentation, instance segmentation, and object detection. Notably, tested on the unified UperNet and Mask R-CNN frameworks, our ViT-Hook surpasses state-of-the-art by a large margin, achieving 59.7 (+4.7) mIoU on ADE20K val, 55.0 (+3.6) box AP and 48.5 (+3.3) mask AP on COCO val2017. Code is available at https://github.com/siyuan-mei/ViT-Hook.
APA:
Mei, S., Thies, M., Xia, Y., Sun, Y., Wu, F., Fan, F.,... Maier, A. (2026). Vision transformer Hook for dense predictions. Pattern Recognition, 179. https://doi.org/10.1016/j.patcog.2026.113818
MLA:
Mei, Siyuan, et al. "Vision transformer Hook for dense predictions." Pattern Recognition 179 (2026).
BibTeX: Download