👀 Problem and Motivation

Multi-head attention (MHA) is the core mechanism of Transformer, where different attention layers (heads) are used in parallel to consider multiple representation subspaces at different positions simultaneously.

But autoregressive decoder inference becomes a severe bottleneck for Transformer models due to the memory bandwidth overhead from loading all attention keys and values at every decoding step. Multi Query Attention (MQA) was proposed to sharply reduce the memory bandwidth by using multiple query heads but single key and value head, thus eliminating the need of loading keys and values repeatedly.

However, multi-query attention (MQA) can lead to quality degradation and training instability. Grouped-query attention (GQA) emerged as a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. GQA has been shown able to achieve performance close to multi-head attention with comparable speed to MQA.

In this project, we are intersted in mapping both MQA and GQA on FPGAs leveraging an in-house framework that we have

built named MASE.

image.png

Grouped self-attention mechanism for a memory-efficient Transformer

Fast Transformer Decoding: One Write-Head is All You Need

Fast Prototyping Next-Generation Accelerators for New ML Models...


💭 Proposal

The student would rely on our existing hardware stack on MASE to implement GQA and MQA. The student would also need to benchmark the hardware performance of these methods


🛫 Plan