Vision transformers are now seen as the backbone for various vision tasks.
Training data-efficient image transformers & distillation through attention
ICCV 2021 Open Access Repository
PVT v2: Improved Baselines with Pyramid Vision Transformer
However, it is interesting to look at how to find an efficient vision transformer architecture guided by some general design principles. For instance, researchers have already looked at how to scale ResNets
Revisiting ResNets: Improved Training and Scaling Strategies
There are also some NAS without training work on Vision Transformers
Auto-scaling Vision Transformers without Training
We have also looked at a similar problem on transformers for language tasks
Wide Attention Is The Way Forward For Transformers?
This project will look at how to design a scaling strategy for Vision Transformers. We will look at a specific design dimension, that is to build wide vision transformers with a mix of different patching strategies, kernel sizes and attention mechanisms.
The candidate should be experienced in Object Orientated Programming in Python. Ideally, the candidate should have experience or at least willing to learn various Machine Learning frameworks in Python (such as Pytorch and Pytorch Lightning).