Adversarial robustness benchmark

PromptBench can evaluate the adversarial robustness of LLMs to prompts. More information can be found at PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts.

Please contact us if you want the results of your models shown in this leaderboard.

Model	SST-2	CoLA	QQP	MPRC	MNLI	QNLI	RTE	WNLI	MMLU	SQuAD v2	IWSLT	UN Multi	Math	Avg
T5-Large	0.04±0.11	0.16±0.19	0.09±0.15	0.17±0.26	0.08±0.13	0.33±0.25	0.08±0.13	0.13±0.14	0.11±0.18	0.05±0.12	0.14±0.17	0.13±0.14	0.24±0.21	0.13±0.19
Vicuna-13b-v1.3	0.83±0.26	0.81±0.22	0.51±0.41	0.52±0.40	0.67±0.38	0.87±0.19	0.78±0.23	0.78±0.27	0.41±0.24	-	-	-	-	0.69±0.34
Llama2-13b-chat	0.24±0.33	0.38±0.32	0.59±0.33	0.84±0.27	0.32±0.32	0.51±0.39	0.68±0.39	0.73±0.37	0.28±0.24	-	-	-	-	0.51±0.39
UL2	0.03±0.12	0.13±0.20	0.02±0.04	0.06±0.10	0.06±0.12	0.05±0.11	0.02±0.04	0.04±0.03	0.05±0.11	0.10±0.18	0.15±0.11	0.05±0.05	0.21±0.21	0.08±0.14
ChatGPT	0.17±0.29	0.21±0.31	0.16±0.30	0.22±0.29	0.13±0.18	0.25±0.31	0.09±0.13	0.14±0.12	0.14±0.18	0.22±0.28	0.17±0.26	0.12±0.18	0.33±0.31	0.18±0.26
GPT-4	0.24±0.38	0.13±0.23	0.16±0.38	0.04±0.06	-0.03±0.02	0.05±0.23	0.03±0.05	0.04±0.04	0.04±0.04	0.27±0.31	0.07±0.14	-0.02±0.01	0.02±0.18	0.08±0.21

Prompt Type	SST-2	CoLA	QQP	MPRC	MNLI	QNLI	RTE	WNLI	MMLU	SQuAD v2	IWSLT	UN Multi	Math	Avg
ZS-task	0.31±0.39	0.43±0.35	0.43±0.42	0.44±0.44	0.29±0.35	0.46±0.39	0.33±0.39	0.36±0.36	0.25±0.23	0.16±0.26	0.18±0.22	0.17±0.18	0.33±0.26	0.33±0.36
ZS-role	0.28±0.35	0.43±0.38	0.34±0.43	0.51±0.43	0.26±0.33	0.51±0.40	0.35±0.40	0.39±0.39	0.22±0.26	0.20±0.28	0.24±0.25	0.15±0.16	0.39±0.30	0.34±0.37
FS-task	0.22±0.38	0.24±0.28	0.16±0.21	0.24±0.32	0.19±0.29	0.30±0.34	0.31±0.39	0.37±0.41	0.18±0.23	0.06±0.11	0.08±0.09	0.04±0.07	0.16±0.18	0.21±0.31
FS-role	0.24±0.39	0.25±0.36	0.14±0.20	0.23±0.30	0.21±0.33	0.32±0.36	0.27±0.38	0.33±0.38	0.14±0.20	0.07±0.12	0.11±0.10	0.04±0.07	0.17±0.17	0.21±0.31

Model	TextBugger	DeepWordBug	TextFoller	BertAttack	CheckList	StressTest	Semantic
T5-Large	0.09±0.10	0.13±0.18	0.20±0.24	0.21±0.24	0.04±0.08	0.18±0.24	0.10±0.09
Vicuna-13b-v1.3	0.81±0.25	0.69±0.30	0.80±0.26	0.84±0.23	0.64±0.27	0.29±0.40	0.74±0.25
Llama2-13b-chat	0.67±0.36	0.41±0.34	0.68±0.36	0.74±0.33	0.34±0.33	0.20±0.30	0.66±0.35
UL2	0.04±0.06	0.03±0.04	0.14±0.20	0.16±0.22	0.04±0.07	0.06±0.09	0.06±0.08
ChatGPT	0.14±0.20	0.08±0.13	0.32±0.35	0.34±0.34	0.07±0.13	0.06±0.12	0.26±0.22
GPT-4	0.03±0.10	0.02±0.08	0.18±0.19	0.27±0.40	-0.02±0.09	0.03±0.15	0.03±0.16
Avg	0.21±0.30	0.16±0.26	0.31±0.33	0.33±0.34	0.12±0.23	0.11±0.23	0.22±0.26

Dataset	TextBugger	DeepWordBug	TextFoller	BertAttack	CheckList	StressTest	Semantic
SST-2	0.25±0.39	0.18±0.33	0.35±0.41	0.34±0.44	0.22±0.36	0.15±0.31	0.28±0.35
CoLA	0.39±0.40	0.27±0.32	0.43±0.35	0.45±0.38	0.23±0.30	0.18±0.25	0.34±0.37
QQP	0.30±0.38	0.22±0.31	0.31±0.36	0.33±0.38	0.18±0.30	0.06±0.25	0.40±0.39
MPRC	0.37±0.42	0.34±0.41	0.37±0.41	0.42±0.38	0.24±0.37	0.25±0.33	0.39±0.39
MNLI	0.32±0.40	0.18±0.29	0.32±0.39	0.34±0.36	0.14±0.24	0.10±0.25	0.22±0.24
QNLI	0.38±0.39	0.40±0.35	0.50±0.39	0.52±0.38	0.25±0.39	0.23±0.33	0.40±0.35
RTE	0.33±0.41	0.25±0.35	0.37±0.44	0.40±0.42	0.18±0.32	0.17±0.24	0.42±0.40
WNLI	0.39±0.42	0.31±0.37	0.41±0.43	0.41±0.40	0.24±0.32	0.20±0.27	0.49±0.39
MMLU	0.21±0.24	0.12±0.16	0.21±0.20	0.40±0.30	0.13±0.18	0.03±0.15	0.20±0.19
SQuAD V2	0.09±0.17	0.05±0.08	0.25±0.29	0.31±0.32	0.02±0.03	0.02±0.04	0.08±0.09
IWSLT	0.08±0.14	0.10±0.12	0.27±0.30	0.12±0.18	0.10±0.10	0.17±0.19	0.18±0.14
UN Multi	0.06±0.08	0.08±0.12	0.15±0.19	0.10±0.16	0.06±0.07	0.09±0.11	0.15±0.18
Math	0.18±0.17	0.14±0.13	0.49±0.36	0.42±0.32	0.15±0.11	0.13±0.08	0.23±0.13
Avg	0.21±0.30	0.17±0.26	0.31±0.33	0.33±0.34	0.12±0.23	0.11±0.23	0.22±0.26