Let us look at a recent advancement made in Computer Vision using Self Supervised Learning Emerging Properties in Self-Supervised Vision Transformers where the authors employ a variant of a previous SSL method and discover that Self-Supervised Vision Transformers naturally exhibit features that are useful for semantic segmentation and also find that they outperform most of the existing methods on Image Retrieval Tasks.
The approach “Self Distillation with No Labels” aka. DINO interprets the existing Bootstrap your own latent: A new approach to self-supervised Learning approach from a Knowledge Distillation perspective with some modifications on their own to produce SOTA Vision Transformers that naturally segment foreground from the background along with other useful features that help in retrieval problems.
Before looking into DINO in-depth, let us understand what Self-Supervised Learning is, how BYOL proposes an elegant way of extracting useful representations out of unlabelled data and how DINO improvises on top of BYOL
Till now, most of the advancements in Deep Learning focuses on Supervised Learning where the assumption is that loads of labeled data are available. While there is an active stream of research that pays attention to generating synthetic data, it is not always feasible to generate seamless real-world data.
Even if there are loads of data available, labeling them is expensive in terms of time and effort. Therefore there can always be scenarios where labeled data is limited. So how can we ensure that neural networks achieve higher accuracy under such conditions?
Q. What if there is a way to learn useful representations out of the abundant unlabelled data?
This paradigm of learning representations out of unlabelled data for improving downstream network’s performance (not always necessarily) is called Self-Supervised Learning. The different ways/approaches to learning representations are called pretext tasks.
Now that we know what Self-Supervised Learning is, let us look at the various pretext tasks available:
Q. We now know that there are a variety of methods available to learn representations from unlabelled data. Great! But what next? How do we make use of these representations?
In general, a network trained using SSL methods is used as initialization for the training on limited labeled data.
Note: We won’t be looking in detail into all of the pretext tasks. The scope of this article is only limited to Contrastive Learning
Let us try to understand how neural networks learn representations using Contrastive Learning.
Contrastive learning is one of many paradigms that fall under Deep Distance Metric Learning where the objective is to learn a distance in a low dimensional space which is consistent with the notion of semantic similarity. In simple terms (considering image domain), it means to learn similarity among images where distance is less for similar images and more for dissimilar images. Siamese/Twin networks are used where a positive or a negative image pair is fed and similarity is learned using the contrastive loss.
Q. Okay, so we can train a contrastive model only if we know the positives and negatives for a given dataset. But the whole point of SSL is to train without labels right?
Yes, hence the below solution has been proposed.
Q. The above approach is what forms the crux of SimCLR and SwAV methods. This is good to know, but why are we looking at SimCLR or SwAV when DINO is based on BYOL?
The important thing to note here is, when we treat every other sample in the batch as a negative(3), there are chances that a similar image can fall under negatives and this, in turn, produces not-so-ideal representations.
To get rid of negatives as well as to avoid the possibility of learning dissimilar representations for similar images, another contrastive learning-based SSL approach called BYOL was introduced where a neural network is trained with only positive pairs to learn very good representations.
BYOL is another Contrastive Learning-based SSL technique. It learns very good representations even though it uses no negative pairs during training.
To solve the problem of collapse representations (or trivial solution), here in BYOL, a fixed randomly initialized network is considered as the target network and a trainable online network is used to learn representations (while the target network is nothing but the moving exponential average of the online network).
Great!
We now know what Self-Supervised Learning is, where Contrastive Learning fits into SSL and how BYOL learns rich representations from just positive pairs.
Let us now look at how DINO makes use of the approach used in BYOL.
Self Distillation with No Labels (DINO) is a recent SSL technique that follows a very similar approach as BYOL, but has differences in the way the inputs are fed to the network and also makes use of Vision Transformers as the feature extractor.
Although the method seems to be similar to BYOL, there is one key difference in the way in which the inputs are fed in DINO - multi-crop training
The input to the teacher network is the complete image while the student network takes in the complete image as well as random crops of the whole image. An example pair could be something like <a random crop of the whole image, the whole image>. The fact that the teacher is fed only the entire image and the student is fed both kinds forces the student to attend to foreground objects in the image.
Now that we understand how DINO works, let us see why there is this new notation of Student and Teacher networks here which was not the case with BYOL.
Knowledge Distillation is a technique in which a smaller student network is trained to match the output of a relatively larger teacher network. This is also known as model compression where the objective is to train a smaller network (which is the student network) that matches the accuracy of a larger network (which is the teacher network).
When both the student and teacher are the same network, then it is called as Self-Distillation.
One can now draw comparisons between BYOL and Self-Distillation. We can see that the target network (in BYOL) can be seen as the teacher network here and the online network (in BYOL) can be seen as the student network.
And since this whole process is self-supervised i.e. no labels, the approach is called as Self-Distillation with No Labels aka DINO.
Even though the approach followed in DINO seems similar to BYOL, it beats BYOL along with other SSL approaches such as SimCLR, SwAV, etc.
Also, DINO when combined with Vision Transformers (ViT) produces SOTA results on ImageNet - even better than a supervised baseline.
This clearly shows that multi-crop training plays a significant role in DINO’s performance.
The authors have observed that SSL when combined with ViT specifically exhibits strong semantic segmentation properties as well useful features for image retrieval tasks, compared to supervised ViT or ConvNets.
DINO can be used for a variety of applications ranging from object segmentation, image classification tasks with limited labeled data, and image retrieval tasks.
Q. How can DINO be exploited for image classification tasks?
Given a set of unlabelled image data, we can train a network on that data using DINO method and then use its weights directly for finding k-nearest neighbors or use the weights as initialization for the downstream task (which is usually trained on a limited labeled set).
We hope you folks got the intuition behind Self-Supervised Learning, DINO, and the motivation behind using it
Will be back with more interesting papers in the future.
Cheers!
We are rapidly expanding our AI engineering team towards building cutting-edge Computer Vision techniques to solve radically different challenges we face when applying AI at a global scale.
Our AI today automatically indexes thousands of SKUs every day to positively impact the manufacturing, distribution, and placement of essential products. If you want to reach out to us please leave us a message on the form below.