Knowledge distillation for quantized transformers

Attention is all you need

The transformer introduces a nice formulation of weighted sum of the value vector $V_k$, naturally, we should be able to build distillations for all these post-softmax values before they multiply to $V_k$.

The idea is to use knowledge distillation to explore how low-bitwidth number representations can help building more efficient inference pipeline for transformer models.

Transformer models are now widely used in many language processing tasks, yet their run-time efficiency prevents them from being deployed on a wide range of devices. The major computation block in a transformer is the multi-head self-attention module. This module considers three input vectors Q, K, V, and calculates the output as Y=softmax(QK^T)V. This then provides us a nice formulation of a weighted sum of its value vector V, naturally, we should be able to build distillations for all these post-softmax values before they multiply to V [1].

The project will explore how existing knowledge distillation techniques [2] can help us to utilise this post-softmax probability vector. The objective is to distill a smaller model without sacrificing too much performance. The smaller model would use a smaller number of attention heads, or has a hidden dimension (aka. smaller matmult), or more efficient number representations such as fixed-point quantization.

[1] ***https://arxiv.org/abs/1706.03762***

[2] ***https://arxiv.org/abs/1503.02531***

Meeting notes (Nov 24th)

We fine-tune on the pre-training task with the student to form a better teacher
- There might be a training time issue
- Take BERT from huggingface as the pre-trained model
- Fine-tune on the pre-training task (wikibooks) and also the downsteram task