preface
Data sets in the medical field are characterized by a small number of labeled samples and unnatural images. Transformer has proved its success in the natural image field, but can transformer be applied to the unnatural image field with a small number of labeled samples, such as the medical field?
This paper compares the performance of CNN and ViTs in the medical image task under three different initialization strategies, studies the influence of self-supervised pre-training on the medical image field, and draws three conclusions.
This article is from the public CV technical guide of the paper sharing series
Follow public accountCV Technical GuideFocus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
Code: github.com/ChrisMats/m…
Background
Many approaches have been proposed to adapt the Transformers for visual tasks. In natural images, Transformers have proven to be superior to CNN in standard visual tasks such as ImageNet classification, as well as object detection and semantic segmentation. The Transformers’ central attention mechanism offers several key advantages over convolution: (1) it captures remote relationships, (2) it has adaptive modeling capabilities with dynamically calculated self-attention weights that capture the relationships between Tokens, and (3) it provides built-in visibility into what the models are focused on.
However, there is evidence that Vision Transformer requires a very large data set to outperform CNN, and the benefits of ViT only become apparent when Google’s 300 million private image dataset, JFT-300M, is used for pre-training. Their reliance on data on this scale is a barrier to widespread use of Transformers. The problem is particularly acute in medical imaging, where data sets are small and often accompanied by less reliable labels.
CNN, like ViTs, has poor performance when data is scarce. The standard solution is to use transfer learning: typically, the model is pre-trained on a large data set (such as ImageNet) and then fine-tuned for a particular task using a smaller, specialized data set. CNNS pre-trained on ImageNet generally outperformed those trained from scratch in the medical field in terms of final performance and reduced training time.
Self-supervision is a learning method for processing unlabeled data, which has recently gained wide attention. The research shows that self-supervised pre-training of CNN in the target domain before fine-tuning can improve the performance. The initialization of ImageNet contributes to faster convergence of self-supervised CNN and generally has better predictive performance.
These technologies for dealing with the lack of data in medical imaging have proven effective for CNN, but it is unclear whether Vision Transformer will have similar benefits. Some studies have shown that pre-training for medical image analysis of CNN using ImageNet does not rely on feature reuse (following the conventional wisdom), but rather due to better initialization and weight adjustment. That raises questions about whether transformers will benefit from the technology. If they do, there is little to stop ViTs from becoming the dominant architecture for medical imaging.
In this work, the paper explores whether ViTs can easily replace CNN for medical imaging tasks, and whether there are advantages to doing so. The paper considers the use case of a typical practitioner, equipped with a limited computing budget and access to traditional medical data sets, with an eye toward a “plug-and-play” solution. To this end, experiments are carried out on three mainstream public data sets.
Through these experiments, the following conclusions are drawn:
-
ViTs pretrained on ImageNet show comparable performance to CNN with limited data.
-
Transfer learning benefits ViTs when standard training programs and Settings are applied.
-
When self-supervised pre-training was followed by supervised fine-tuning, ViTs performed better than CNN.
These findings suggest that medical image analysis can seamlessly transition from CNN to ViTs while achieving better interpretability.
Methods
The main question investigated in this paper is whether ViTs can be used as a plug and play alternative to CNN for medical diagnostic tasks. To this end, a series of experiments were carried out to compare the differences between ViTs and CNN under similar conditions and to keep the hyperparameter adjustment to a minimum. To ensure the fairness and interpretability of the comparison, representative ResNet50 and deit-s with 16×16 Tokens were selected as vits. These models were chosen because they are comparable in terms of number of parameters, memory requirements, and computation.
As mentioned above, CNN relies on initialization strategies to improve performance when data is not very rich, which is the case for medical images. The standard approach is to use transfer learning – initializing the model with pre-trained weights on ImageNet and fine-tuning the target domain.
Therefore, the paper considers three initialization strategies :(1) random initialization weights, (2) transfer learning using supervised ImageNet pretraining weights, and (3) self-supervised pretraining on the target data set after initialization, as shown in (2). Apply these strategies to three standard medical imaging datasets to cover different target domains:
APTOS 2019- In this dataset, the task was to classify images of diabetic retinopathy into 5 categories of disease severity. Aptos 2019 contains 3,662 high-resolution retinal images.
ISIC 2019- The task was to classify 25,333 dermatoscopic images from nine different skin injury diagnostic categories.
CBIS-DDSM- This dataset contains 10,239 mammograms whose task is to detect the presence of lumps on mammograms.
Experiments
Comparison of CNN and ViTs under different initialization strategies
1. Is randomly initialized Transformer useful?
First, deit-s is compared with ResNet50 with random initialization weight (Kaiming initialization). The results in the above table show that under this setting, CNN far exceeds ViTs in all aspects. These results are consistent with previous observations in the field of natural images, where ViTs trained on limited data outperform CNN of similar size, a trend that has been attributed to the lack of inductive bias in ViTs. Due to the moderate size of most medical imaging data sets, the usefulness of randomly initialized ViTs appears to be limited.
2. Does pre-trained Transformer on ImageNet work in medical imaging?
Random initialization is rarely used in practice on medical image data sets. The standard procedure is to train CNN by initializing the network with pre-trained weights from ImageNet and then fine-tuning the data from the target domain. Here, the paper investigates whether this approach can be effectively applied to ViTs. To test this, the paper initializes all models using weights that have been pre-trained on ImageNet in a fully supervised manner. Then, fine-tune using the procedure above.
The results in the above table show that both CNN and ViTs benefit significantly from ImageNet initialization. In fact, ViT seems to benefit more, as they are performing on par with CNN. This shows that CNN can be replaced by ViTs when using ImageNet for initialization without affecting the performance of medical imaging tasks using medium size training data.
3.Medical image fieldIs it beneficial to use self-supervised transformer?
Recent self-supervised learning schemes, such as Dino and BYOL, take a supervised learning approach. In addition, if they are used for pre-training and supervised fine-tuning, they can achieve new SOTA. Although this phenomenon has been demonstrated in CNN and ViTs in larger data systems, it is not clear whether self-supervised pre-training of ViTs can be helpful for medical imaging tasks, especially for medium – and low-size data.
To verify this, the paper uses Dino’s self-supervised learning scheme, which can be easily applied to CNN and ViTs. Dino used self-distillation to encourage student and teacher networks to produce similar representations with different augmented inputs. Self-supervised pretraining starts with ImageNet initialization and then applies self-supervised learning to target medical domain data as recommended by the default Settings of the authors of the original paper – except for three small changes: (1) The basic learning rate is set to 0.0001, (2) the initial weight attenuation is set to 10 ‘5 and increased to 10’ 4 using the cosine progress chart, and (3) the mean root mean square used is 0.99. CNN and ViTs use the same setup; Both were pre-trained for 300 cycles using batch sizes of 256 and then fine-tuned.
The results reported in the table above show that both ViTs and CNN performed better in self-supervised pre-training. In this case, ViTs appears to have outperformed CNN, albeit by a small margin. Studies of natural images suggest that the gap between VITS and CNN will widen with more data.
Conclusion
This paper compares the performance of CNN and ViTs in medical image task under three different initialization strategies. The effect of self-supervised pretraining on medical image field was studied.
The results showed that the improvement in ViTs and CNNS was small but constant. Although use the supervision ViTs won the best overall performance, but it is interesting to note that in this kind of low data area, we haven’t seen in previous reports in the field of natural images with more data ViTs strong advantage, in in, for example, as a result of the experts marking cost, there are few big tag medical image data sets, However, it is possible to collect a large number of unlabeled images. This suggests that there is an attractive opportunity to apply self-supervision to large data sets of medical images, of which only a small fraction is tagged.
It is found that for the medical image field:
-
As might be expected, ViTs performed worse than CNN on a low-data regime if it was simply trained from scratch.
-
Transfer learning Bridges the performance gap between CNN and ViTs. Similar performance.
-
Through self-supervised pre-training + fine-tuning for optimum performance,ViTsIt has a slight advantage over CNN.
Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
Other articles
ICCV2021 | Vision reflection and improvement of the relative position encoding in the Transformer
ICCV2021 | TransFER: use the Transformer to study relationship between perception of facial expressions
2021- Review of multi-target tracking in video surveillance
Understanding target detection algorithms from a unified perspective: Analysis and summary of recent advances
Image inpainting required 10 paper | HOG and SIFT image features extraction
To fully understand the target detection of anchor | instance version “synthetical consolidation and rehabilitation division summary of single phase review division | small target detection problems, ideas and solutions
Small target detection in return loss function summary | target detection methods summarized
Visual Transformer review | small target detection in 2021 newest research were reviewed
Siamese network overview | | attitude estimate review semantic segmentation were reviewed
CVPR2021 | SETR: use Transformer to rethink the semantic segmentation from the Angle of the sequence to sequence
Deep learning model size and model inference speed
The difference between video target detection and image target detection
One year working experience and perception of CV algorithm engineer
Video understanding overview: Action recognition, sequence action localization, video Embedding
The present situation of computer vision from CVPR 2021 paper
ICCV2021 | MicroNet: at a low FLOPs improve image recognition
ICCV2021 | depth understanding of CNN
ICCV2021 | to rethink the visual space dimension of transformers
CVPR2021 | TransCenter: transformer used in multiple target tracking algorithm
CVPR2021 | open the target detection of the world
CVPR2021 | TimeSformer – video understand note model of space and time
CVPR2021 | an efficient pyramid segmentation module PSA
New way YOLOF CVPR2021 | characteristics of pyramid
Classic paper series | to rethink on ImageNet training beforehand
Classic paper series | Group, Normalization and the defect of BN
Classic paper series | target detection – CornerNet & anchor boxes of defects
Classic paper series | narrow gap between the Anchor – -based and Anchor – free testing method: adaptive training sample selection