Simulation-driven training of vision transformers enables metal artifact reduction of highly truncated CBCT scans

Fan F, Ritschl L, Beister M, Biniazan R, Wagner F, Kreher BW, Gottschalk T, Kappler S, Maier A (2023)

Publication Language: English

Publication Status: Published

Publication Type: Journal article, Original article

Publication year: 2023


Publisher: John Wiley and Sons Ltd

DOI: 10.1002/mp.16919


Background: Due to the high attenuation of metals, severe artifacts occur in cone beam computed tomography (CBCT). The metal segmentation in CBCT projections usually serves as a prerequisite for metal artifact reduction algorithms.

Purpose: The occurrence of truncation caused by the limited detector size leads to the incomplete acquisition of metal masks from the threshold-based method in CBCT volume. Therefore, segmenting metal directly in CBCT projections is pursued in this work.

Methods: Since the generation of high quality clinical training data is a constant challenge, this study proposes to generate simulated digital radiographs (data~I) based on real CT data combined with self-designed computer aided design (CAD) implants. In addition to the simulated projections generated from 3D volumes, 2D X-ray images combined with projections of implants serve as the complementary data set (data~II) to improve the network performance. In this work, SwinConvUNet consisting of shift window (Swin) vision transformers (ViTs) with patch merging as encoder is proposed for metal segmentation.

Results: The model's performance is evaluated on accurately labeled test datasets obtained from cadaver scans as well as the unlabeled clinical projections. When trained on the data~I only, the convolutional neural network (CNN) encoder-based networks UNet and TransUNet achieve only limited performance on the cadaver test data, with an average dice score of 0.821 and 0.850. After using both data~II and data~I during training, the average dice scores for the two models increase to 0.906 and 0.919, respectively. By replacing the CNN encoder with Swin transformer, the proposed SwinConvUNet reaches an average dice score of 0.933 for cadaver projections when only trained on the data~I. Furthermore, SwinConvUNet has the largest average dice score of 0.953 for cadaver projections when trained on the combined data set.

 Conclusions: Our experiments quantitatively demonstrate the effectiveness of the combination of the projections simulated under two pathways for network training. Besides, the proposed SwinConvUNet trained on the simulated projections performs state-of-the-art, robust metal segmentation as demonstrated on experiments on cadaver and clinical data sets. With the accurate segmentations from the proposed model, metal artifact reduction can be conducted even for highly truncated CBCT scans.

Authors with CRIS profile

Involved external institutions

How to cite


Fan, F., Ritschl, L., Beister, M., Biniazan, R., Wagner, F., Kreher, B.W.,... Maier, A. (2023). Simulation-driven training of vision transformers enables metal artifact reduction of highly truncated CBCT scans. Medical Physics.


Fan, Fuxin, et al. "Simulation-driven training of vision transformers enables metal artifact reduction of highly truncated CBCT scans." Medical Physics (2023).

BibTeX: Download