Model Compression

Model compression remains an effective approach for accelerating neural network inference. Classic model compression methods include a variety of pruning methods, network architecture search and quantization. In this project, we will focus on channel pruning techniques and low-precision quantization.

Quantization

Models might be sensitive to different forms of low-precision quantizations:

Log-based quantization
Binary or ternary quantization

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
Bounding-box Binary or ternary

Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point

Channel Pruning

Channel pruning aims to remove certain channels with low saliency values.

Channel Pruning for Accelerating Very Deep Neural Networks

Layer skipping

Layer skipping, as its name suggests is we skip (or reduce) the number of layers of a neural network so that it runs more efficiently.

Knowledge Distillation on Transformers using Quantization and Pruning

TinyBERT: Distilling BERT for Natural Language Understanding

Three level KD at post-norm (MSE), attention (MSE) and output distribution (CrossEntropy)

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Very similar to TinyBERT but applied to quantization. The quantization they used contained a ’scaling factor’.

EBERT: Efficient BERT Inference with Dynamic Structured Pruning