There is now a growing number of attention mechanism for improving the $N^2$ computation overhead. A dizzying number of X-former models have been proposed (Reformer, Linformer, Performer, Longformer).
The idea is whether we can search through these different transformer attention mechanisms using Network Architecture Search methods to find appropriate efficient attention mechanisms for the given task and data.
A previous attempt (in the searchformer codebase) was to combine additive attention with the standard attention.
Fastformer: Additive Attention Can Be All You Need
But it seems like this fastformer is not getting any good performance on a standard translation task (which made me mad).
Now I realised the new code base
GitHub - google-research/long-range-arena: Long Range Arena for Benchmarking Efficient Transformers
Long Range Arena: A Benchmark for Efficient Transformers
An obvious solution is now to implement a NAS algorithm in here!
###################
###################