Background

Vision transformers are now seen as the backbone for various vision tasks.

Training data-efficient image transformers & distillation through attention

DeiT III: Revenge of the ViT

ICCV 2021 Open Access Repository

PVT v2: Improved Baselines with Pyramid Vision Transformer

Knowledge Distillation

General KD

Distilling the Knowledge in a Neural Network

Fitnets: Hints for thin deep nets

KD on Language models

TinyBERT: Distilling BERT for Natural Language Understanding

TernaryBERT: Distillation-aware Ultra-low Bit BERT

EBERT: Efficient BERT Inference with Dynamic Structured Pruning

Styles of KD

Kernel Based Progressive Distillation for Adder Neural Networks