A nice video explanation of the paper.
The proposed learning paradigm:
- Self-supervised pretraining
- Supervised finetuning
- Distillation: train a student to learn the output of the teacher rather than the true label.
It seems to have a rather counterintuitive conclusion. Labeled data do not always help. Or too many labeled data used during training does not help.