Vision transformer Hook for dense predictions

Mei S, Thies M, Xia Y, Sun Y, Wu F, Fan F, Gu M, Ye C, Huang Y, Christlein V, Maier A (2026)

Publication Language: English

Publication Type: Journal article, Original article

Publication year: 2026

Journal

Pattern Recognition Elsevier

Book Volume: 179

Pages Range: 1-12

Article Number: 113818

URI: https://www.sciencedirect.com/science/article/pii/S0031320326007831?via=ihub

DOI: 10.1016/j.patcog.2026.113818

Abstract

Pre-trained vision transformers (ViTs) have demonstrated remarkable capability in learning semantically rich image representations. However, their underlying plain architectures yield low-resolution feature maps, lacking essential fine-grained spatial details required for dense prediction tasks. To better transfer the learned visual features, we present ViT-Hook, a novel hybrid backbone compatible with plain ViTs that effectively bridges the gap between global semantic understanding and local spatial encodings. Specifically, our method aims to broaden the scope and impact of ViT from the following perspectives: (1) We propose a simple transformer-decoder-inspired hook module that receives hierarchical CNN features as spatial queries and interacts with expressive ViT features from large-scale pre-training, therefore instantiating general-purpose representations into task-suited ones. (2) ViT-Hook is a plug-and-play solution for powerful vision foundation models, such as DINOv2 and RADIO. In this case, we find that only partially fine-tuning several intermediate ViT layers can outperform previous full fine-tuning methods, while substantially reducing compute and memory burdens with most parameters frozen. (3) We evaluate ViT-Hook with various pre-trained sources on multiple dense prediction tasks, including semantic segmentation, instance segmentation, and object detection. Notably, tested on the unified UperNet and Mask R-CNN frameworks, our ViT-Hook surpasses state-of-the-art by a large margin, achieving 59.7 (+4.7) mIoU on ADE20K val, 55.0 (+3.6) box AP and 48.5 (+3.3) mask AP on COCO val2017. Code is available at https://github.com/siyuan-mei/ViT-Hook.

Authors with CRIS profile

Siyuan Mei Lehrstuhl für Informatik 5 (Mustererkennung) Mareike Thies Lehrstuhl für Informatik 5 (Mustererkennung) Yan Xia Department of Orthodontics and Orofacial Orthopaedics Yipeng Sun Lehrstuhl für Informatik 5 (Mustererkennung) Fei Wu Lehrstuhl für Informatik 5 (Mustererkennung) Fuxin Fan Lehrstuhl für Informatik 5 (Mustererkennung) Mingxuan Gu Lehrstuhl für Informatik 5 (Mustererkennung) Chengze Ye Lehrstuhl für Informatik 5 (Mustererkennung) Vincent Christlein Lehrstuhl für Informatik 5 (Mustererkennung) Andreas Maier Lehrstuhl für Informatik 5 (Mustererkennung)

Involved external institutions

Peking University (PKU) / 北京大学

China (CN)

How to cite

APA:

Mei, S., Thies, M., Xia, Y., Sun, Y., Wu, F., Fan, F.,... Maier, A. (2026). Vision transformer Hook for dense predictions. Pattern Recognition, 179, 1-12. https://doi.org/10.1016/j.patcog.2026.113818

MLA:

Mei, Siyuan, et al. "Vision transformer Hook for dense predictions." Pattern Recognition 179 (2026): 1-12.

BibTeX: Download