Background

Vision transformers are now seen as the backbone for various vision tasks.

Training data-efficient image transformers & distillation through attention

DeiT III: Revenge of the ViT

ICCV 2021 Open Access Repository

PVT v2: Improved Baselines with Pyramid Vision Transformer

However, it is interesting to look at how to find an efficient vision transformer architecture guided by some general design principles. For instance, researchers have already looked at how to scale ResNets

Revisiting ResNets: Improved Training and Scaling Strategies

There are also some NAS without training work on Vision Transformers

Auto-scaling Vision Transformers without Training

We have also looked at a similar problem on transformers for language tasks

Wide Attention Is The Way Forward For Transformers?

Project aim

This project will look at how to design a scaling strategy for Vision Transformers. We will look at a specific design dimension, that is to build wide vision transformers with a mix of different patching strategies, kernel sizes and attention mechanisms.

Skill requirements

The candidate should be experienced in Object Orientated Programming in Python. Ideally, the candidate should have experience or at least willing to learn various Machine Learning frameworks in Python (such as Pytorch and Pytorch Lightning).