Background

Vision transformers are now seen as the backbone for various vision tasks.

Knowledge Distillation

Distilling the Knowledge in a Neural Network

Fitnets: Hints for thin deep nets

TinyBERT: Distilling BERT for Natural Language Understanding

Three level KD at post-norm (MSE), attention (MSE) and output distribution (CrossEntropy)

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Very similar to TinyBERT but applied to quantization. The quantization they used contained a ’scaling factor’.

EBERT: Efficient BERT Inference with Dynamic Structured Pruning

Kernel Based Progressive Distillation for Adder Neural Networks