Dynamic evaluation benchmark

DyVal is a new dynamic evaluation protocol for LLMs. More information can be found at DyVal: Graph-informed Dynamic Evaluation of Large Language Models.

Please contact us if you want the results of your models shown in this leaderboard.

[All results] [View by Complexity]

All results

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.79	-	50.76	37.04	21.10	21.58	-
LLaMA2-13B Chat	8.33	-	16.15	35.72	7.73	28.05	-
ChatGPT	84.50	26.63	97.34	66.56	52.49	56.09	13.63
GPT4	89.88	45.03	99.33	93.92	66.33	79.02	23.36

View by Complexity

Complexity 1

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	1.89	-	81.33	25.73	44.51	21.60	-
LLaMA2-13B Chat	25.07	-	19.20	50.27	1.82	27.62	-
ChatGPT	95.27	36.22	99.09	81.96	41.78	62.27	28.14
GPT4	99.00	57.05	100.00	94.45	89.29	87.22	31.56

Complexity 2

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.73	-	55.11	43.87	22.42	21.84	-
LLaMA2-13B Chat	4.44	-	14.51	40.38	16.56	29.25	-
ChatGPT	91.60	29.39	98.33	64.75	56.62	54.84	12.95
GPT4	95.11	42.61	99.78	96.06	63.61	86.33	30.45

Complexity 3

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.47	-	37.02	42.77	17.47	21.69	-
LLaMA2-13B Chat	2.20	-	17.18	28.78	9.42	27.38	-
ChatGPT	77.62	24.31	96.84	62.80	58.27	53.64	7.47
GPT4	85.95	43.78	99.00	94.78	57.67	71.17	18.33

Complexity 4

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.09	-	29.58	36.29	0.0	21.18	-
LLaMA2-13B Chat	1.60	-	13.71	23.47	3.13	27.96	-
ChatGPT	71.51	16.60	95.11	56.73	53.29	53.62	5.98
GPT4	79.44	36.67	98.56	90.39	54.78	71.33	13.11