There is now a growing number of attention mechanism for improving the $N^2$ computation overhead. A dizzying number of X-former models have been proposed (Reformer, Linformer, Performer, Longformer).

Screen Shot 2022-03-15 at 2.34.51 PM.png

Screen Shot 2022-03-15 at 2.36.05 PM.png

The idea is whether we can search through these different transformer attention mechanisms using Network Architecture Search methods to find appropriate efficient attention mechanisms for the given task and data.

A previous attempt (in the searchformer codebase) was to combine additive attention with the standard attention.

Fastformer: Additive Attention Can Be All You Need

But it seems like this fastformer is not getting any good performance on a standard translation task (which made me mad).

Now I realised the new code base

GitHub - google-research/long-range-arena: Long Range Arena for Benchmarking Efficient Transformers

Long Range Arena: A Benchmark for Efficient Transformers

An obvious solution is now to implement a NAS algorithm in here!

###################


###################