Planting backdoors into prompting

Prompting refers to an idea of using a customized function to help big language models to become effective on down-stream tasks.

Consider a pre-trained language model now is working on a prediction task that produces a class-wise prediction likelihood of $p(y|x_i)$, we can effective have a prompting function $f_{prompt}$ so that now the prediction task becomes $p(y|f_{prompt}(x_i))$.

There are a few great papers in 2020 and 2021, describing in details how prompting would work to help boost the performance of large language models such as GPT3:

Apparently there is this first paper that looked at how to perform adversarial attack and backdoors on prompting functions.

https://ui.adsabs.harvard.edu/abs/2022arXiv220310714Y/abstract

These attack, however, are not ideal:

The backdoor trigger are extremely obvious.
They mostly focused on plain-text prompting functions and do not explore differentiable prompt .
They majorly focus on how to design an inherently evil prompt by directly backdooring the pre-trained language model. This is very different from our idea of backdooring the prompting function.