Dynamic evaluation benchmark

DyVal is a new dynamic evaluation protocol for LLMs. More information can be found at DyVal: Graph-informed Dynamic Evaluation of Large Language Models.

Please contact us if you want the results of your models shown in this leaderboard.

[All results] [View by Complexity]

All results

Model Arithmetic Linear Equation Boolean Logic Deductive Logic Abductive Logic Reachability Max Sum Path
Vicuna-13B v1.3 0.79 - 50.76 37.04 21.10 21.58 -
LLaMA2-13B Chat 8.33 - 16.15 35.72 7.73 28.05 -
ChatGPT 84.50 26.63 97.34 66.56 52.49 56.09 13.63
GPT4 89.88 45.03 99.33 93.92 66.33 79.02 23.36

View by Complexity

Complexity 1

Model Arithmetic Linear Equation Boolean Logic Deductive Logic Abductive Logic Reachability Max Sum Path
Vicuna-13B v1.3 1.89 - 81.33 25.73 44.51 21.60 -
LLaMA2-13B Chat 25.07 - 19.20 50.27 1.82 27.62 -
ChatGPT 95.27 36.22 99.09 81.96 41.78 62.27 28.14
GPT4 99.00 57.05 100.00 94.45 89.29 87.22 31.56

Complexity 2

Model Arithmetic Linear Equation Boolean Logic Deductive Logic Abductive Logic Reachability Max Sum Path
Vicuna-13B v1.3 0.73 - 55.11 43.87 22.42 21.84 -
LLaMA2-13B Chat 4.44 - 14.51 40.38 16.56 29.25 -
ChatGPT 91.60 29.39 98.33 64.75 56.62 54.84 12.95
GPT4 95.11 42.61 99.78 96.06 63.61 86.33 30.45

Complexity 3

Model Arithmetic Linear Equation Boolean Logic Deductive Logic Abductive Logic Reachability Max Sum Path
Vicuna-13B v1.3 0.47 - 37.02 42.77 17.47 21.69 -
LLaMA2-13B Chat 2.20 - 17.18 28.78 9.42 27.38 -
ChatGPT 77.62 24.31 96.84 62.80 58.27 53.64 7.47
GPT4 85.95 43.78 99.00 94.78 57.67 71.17 18.33

Complexity 4

Model Arithmetic Linear Equation Boolean Logic Deductive Logic Abductive Logic Reachability Max Sum Path
Vicuna-13B v1.3 0.09 - 29.58 36.29 0.0 21.18 -
LLaMA2-13B Chat 1.60 - 13.71 23.47 3.13 27.96 -
ChatGPT 71.51 16.60 95.11 56.73 53.29 53.62 5.98
GPT4 79.44 36.67 98.56 90.39 54.78 71.33 13.11