Prompt engineering
The Prompt Engineering Module collects a variety of prompting methods and evaluates their performance across multiple datasets. This module currently supports models including GPT-3.5-turbo and GPT-4-1106.
Please contact us if you want the results of your models shown in this leaderboard.
All Results
Model | benchmark | baseline | CoT | CoT(zero-shot) | expert prompting | emotion prompt | least to most |
---|---|---|---|---|---|---|---|
GPT3.5 -Turbo | gsm8k | 47.15 | 40.33 | 18.5 | 21.15 | 57.24 | |
GPT3.5 -Turbo | bigbench_date | 57.99 | 49.32 | 80.49 | 61.79 | 66.12 | |
GPT3.5 -Turbo | bigbench_object_tracking | 39.2 | 63.2 | 66 | 56.53 | 29.87 | |
GPT3.5 -Turbo | csqa | 72.48 | 67.81 | 65.85 | 74.45 | 70.68 | |
GPT3.5 -Turbo | last-letter-concat | 7.2 | 79.8 | ||||
GPT4-1106 | gsm8k | 92.19 | 85.89 | 87.34 | 88.7 | 90.83 | |
GPT4-1106 | bigbench_date | 87.8 | 92.14 | 87.53 | 87.26 | 87.8 | |
GPT4-1106 | bigbench_object_tracking | 96.27 | 90.26 | 99.07 | 98.93 | 95.73 | |
GPT4-1106 | csqa | 79.69 | 85.59 | 79.85 | 79.85 | 80.34 | |
GPT4-1106 | last-letter-concat | 25.2 | 96.2 |
“This is very important to my career.” is used in emotion prompt