👀 Problem

Training Large Language Models (LLMs) is now a challenging task because of the amount of compute resources they require. However, there are several scenarios where fine-tuning the LLMs has demonstrate better performance when compared to its zero-shot performance (when the LLM is not trained, and directly used in the downstream task, we call it zero-shot). Parameter Efficient Fine-Tuning (PEFT) has now received a lot attentions and have generally been seen as the next-generation techniques for fine-tuning LLMs with limited resources. In this project, we are interested in investigating how PEFT can be designed with FPGAs as an accelerator so that it achieves the best possible on-device performance.

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

QLoRA: Efficient Finetuning of Quantized LLMs

💭 Proposal

Base elements

The student would have to investigate adaptor-based PEFT methods, one important method to consider is the LoRA framework (Low Rank Approximator). The idea is to approximate the following computation

$$ y=Wx $$

But consider only a low rank matrix $AB$ that is trainable:

$$ y = (W+AB)x = Wx + ABx $$

$A$ is low-rank in this case, and the rank of $A$ and $B$ is significantly lower than the rank of $x$. Since the trainable components are a lot smaller in size, they are more affordable in terms of the computation resource required.

We are interesting in mapping the training of $A$ and $B$ in hardware, utilising a systolic-array based hardware accelerator. This also brings up a new design dimension that is the arithmetic system design for the systolic-array core.

Extensions

The student would then explore other alternative adaptors and how to map them to hardware

🛫 Plan

Read and understand the different implementation of PEFT adaptors and how we have implemented in our in-house compiler MASE.