Model compression remains an effective approach for accelerating neural network inference. Classic model compression methods include a variety of pruning methods, network architecture search and quantization. In this project, we will focus on channel pruning techniques and low-precision quantization.
Models might be sensitive to different forms of low-precision quantizations:
Log-based quantization
Binary or ternary quantization
Bounding-box Binary or ternary
Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point
Channel pruning aims to remove certain channels with low saliency values.
Channel Pruning for Accelerating Very Deep Neural Networks
Layer skipping, as its name suggests is we skip (or reduce) the number of layers of a neural network so that it runs more efficiently.
TinyBERT: Distilling BERT for Natural Language Understanding
TernaryBERT: Distillation-aware Ultra-low Bit BERT
TernaryBERT: Distillation-aware Ultra-low Bit BERT
EBERT: Efficient BERT Inference with Dynamic Structured Pruning