Background

Previous research on accelerating computer vision networks have mainly focused on Convolutional Neural Networks, while many of the existing optimization techniques could be ‘transferred’ to Vision Transformers (ViTs). A few papers have looked at how Transformers can be quantized to low-precision [1], more explorations are needed in this field to look at how various forms of low-bitwidth quantizations can be applied to ViTs.

Project aim

This aim of the project is to test how different component in a vision transformer might be sensitive to different forms of low-precision quantizations:

The student would also have to consider integrate quantization with knowledge distillation for accuracy improvements [2].

I expect the student to run state-of-the-art vision transformers (eg. DeiT, Swin-T and Pyramid Vision Transformers).

The network-dataset pairs should be evaluated on popular vision benchmarks (eg. CIFAR, ImageNet)

Skill requirements

[1] https://arxiv.org/abs/2210.06707 [2] https://arxiv.org/abs/2209.02432