A few papers have looked at how Transformers can be quantized to low-precision [1], more explorations are needed in this field to look at how various forms of low-bitwidth quantizations can be applied to Transformers. Our team has looked at how to emulate the effect of various transformers in Pytorch. This summer internship would aim to explore further and extend our system to more models, number systems, datasets and learning tasks.

The following quantization methods would be implemented in this project:

Log-based quantization
Binary or ternary quantization
Bounding-box Binary or ternary

The student would also have to consider integrate quantization with CUDA functions for run-time performance improvements [2].

Skill requirements

Past experience with Pytorch or similar ML frameworks
Past experience with CUDA is preferred but not necessary.
Past experience with high-level ML frameworks, such as Huggingface and PytorchLightening is preferred but not necessary.