Dynamic evaluation benchmark
DyVal is a new dynamic evaluation protocol for LLMs. More information can be found at DyVal: Graph-informed Dynamic Evaluation of Large Language Models.
Please contact us if you want the results of your models shown in this leaderboard.
[All results] [View by Complexity]
All results
Model | Arithmetic | Linear Equation | Boolean Logic | Deductive Logic | Abductive Logic | Reachability | Max Sum Path |
Vicuna-13B v1.3 | 0.79 | - | 50.76 | 37.04 | 21.10 | 21.58 | - |
LLaMA2-13B Chat | 8.33 | - | 16.15 | 35.72 | 7.73 | 28.05 | - |
ChatGPT | 84.50 | 26.63 | 97.34 | 66.56 | 52.49 | 56.09 | 13.63 |
GPT4 | 89.88 | 45.03 | 99.33 | 93.92 | 66.33 | 79.02 | 23.36 |
View by Complexity
Complexity 1
Model | Arithmetic | Linear Equation | Boolean Logic | Deductive Logic | Abductive Logic | Reachability | Max Sum Path |
Vicuna-13B v1.3 | 1.89 | - | 81.33 | 25.73 | 44.51 | 21.60 | - |
LLaMA2-13B Chat | 25.07 | - | 19.20 | 50.27 | 1.82 | 27.62 | - |
ChatGPT | 95.27 | 36.22 | 99.09 | 81.96 | 41.78 | 62.27 | 28.14 |
GPT4 | 99.00 | 57.05 | 100.00 | 94.45 | 89.29 | 87.22 | 31.56 |
Complexity 2
Model | Arithmetic | Linear Equation | Boolean Logic | Deductive Logic | Abductive Logic | Reachability | Max Sum Path |
Vicuna-13B v1.3 | 0.73 | - | 55.11 | 43.87 | 22.42 | 21.84 | - |
LLaMA2-13B Chat | 4.44 | - | 14.51 | 40.38 | 16.56 | 29.25 | - |
ChatGPT | 91.60 | 29.39 | 98.33 | 64.75 | 56.62 | 54.84 | 12.95 |
GPT4 | 95.11 | 42.61 | 99.78 | 96.06 | 63.61 | 86.33 | 30.45 |
Complexity 3
Model | Arithmetic | Linear Equation | Boolean Logic | Deductive Logic | Abductive Logic | Reachability | Max Sum Path |
Vicuna-13B v1.3 | 0.47 | - | 37.02 | 42.77 | 17.47 | 21.69 | - |
LLaMA2-13B Chat | 2.20 | - | 17.18 | 28.78 | 9.42 | 27.38 | - |
ChatGPT | 77.62 | 24.31 | 96.84 | 62.80 | 58.27 | 53.64 | 7.47 |
GPT4 | 85.95 | 43.78 | 99.00 | 94.78 | 57.67 | 71.17 | 18.33 |
Complexity 4
Model | Arithmetic | Linear Equation | Boolean Logic | Deductive Logic | Abductive Logic | Reachability | Max Sum Path |
Vicuna-13B v1.3 | 0.09 | - | 29.58 | 36.29 | 0.0 | 21.18 | - |
LLaMA2-13B Chat | 1.60 | - | 13.71 | 23.47 | 3.13 | 27.96 | - |
ChatGPT | 71.51 | 16.60 | 95.11 | 56.73 | 53.29 | 53.62 | 5.98 |
GPT4 | 79.44 | 36.67 | 98.56 | 90.39 | 54.78 | 71.33 | 13.11 |