Model Compression

Model compression remains an effective approach for accelerating neural network inference. Classic model compression methods include a variety of pruning methods, network architecture search and quantization. In this project, we will focus on channel pruning techniques and low-precision quantization.

Quantization

Models might be sensitive to different forms of low-precision quantizations:

Channel Pruning

Channel pruning aims to remove certain channels with low saliency values.

Channel Pruning for Accelerating Very Deep Neural Networks

Layer skipping

Layer skipping, as its name suggests is we skip (or reduce) the number of layers of a neural network so that it runs more efficiently.

Knowledge Distillation on Transformers using Quantization and Pruning

TinyBERT: Distilling BERT for Natural Language Understanding

TernaryBERT: Distillation-aware Ultra-low Bit BERT

TernaryBERT: Distillation-aware Ultra-low Bit BERT

EBERT: Efficient BERT Inference with Dynamic Structured Pruning