Background

Previous research on accelerating computer vision networks have mainly focused on Convolutional Neural Networks, while many of the existing optimization techniques could be ‘transferred’ to Vision Transformers (ViTs). A few papers have looked at how Transformers can be quantized to low-precision [1], more explorations are needed in this field to look at how various forms of low-bitwidth quantizations can be applied to ViTs.

Project aim

This aim of the project is to test how different component in a vision transformer might be sensitive to different forms of low-precision quantizations:

Log-based quantization
Binary or ternary quantization
Bounding-box Binary or ternary

The student would also have to consider integrate quantization with knowledge distillation for accuracy improvements [2].

I expect the student to run state-of-the-art vision transformers (eg. DeiT, Swin-T and Pyramid Vision Transformers).

The network-dataset pairs should be evaluated on popular vision benchmarks (eg. CIFAR, ImageNet)

Skill requirements

Past experience with Pytorch or similar ML frameworks
Past experience with FPGA designs. Usage of off-chip memory (eg. DRAM, HBM).
Comfortable in writing System Verilog

[1] https://arxiv.org/abs/2210.06707 [2] https://arxiv.org/abs/2209.02432