Fully quantised training on Transformers

Transformers and their variants have been proven to be an efficient model architecture for many downstream tasks in NLPs.

A great number of papers have then investigated the effect of quantization on these models, but a majority of these papers focus on the inference pass [1]. Prior to the rise of transformers, there are a great number of papers discussing how we can train models at a low precision [2, 3, 4].

It is interesting to understand how would training perform with these different existing low-precision training methods. It is then also important for us to understand the training dynamics of different components in the network and possibly design a novel quantisation strategy or a new arithmetic for low-precision transformer training.

[1] Q8bert: Quantized 8bit bert

[2] Training Deep Neural Networks with 8-bit Floating Point Numbers

[3] FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

[4] https://scholar.google.com/citations?view_op=view_citation&hl=zh-CN&user=lOOmgEgAAAAJ&sortby=pubdate&citation_for_view=lOOmgEgAAAAJ:Zph67rFs4hoC [5] https://github.com/facebookresearch/fairseq [6] https://arxiv.org/pdf/2106.08295.pdf, https://zhuanlan.zhihu.com/p/462976274

【7】 https://arxiv.org/abs/2008.05000

Step 1: Get familiar with FairSeq

https://github.com/facebookresearch/fairseq/tree/main/examples/layerdrop

https://github.com/facebookresearch/fairseq/tree/main/examples/roberta

https://github.com/facebookresearch/fairseq/tree/main/examples/scaling_nmt