Prompt engineering

The Prompt Engineering Module collects a variety of prompting methods and evaluates their performance across multiple datasets. This module currently supports models including GPT-3.5-turbo and GPT-4-1106.

Please contact us if you want the results of your models shown in this leaderboard.

All Results

Model benchmark baseline CoT CoT(zero-shot) expert prompting emotion prompt least to most
GPT3.5 -Turbo gsm8k 47.15 40.33 18.5 21.15 57.24  
GPT3.5 -Turbo bigbench_date 57.99 49.32 80.49 61.79 66.12  
GPT3.5 -Turbo bigbench_object_tracking 39.2 63.2 66 56.53 29.87  
GPT3.5 -Turbo csqa 72.48 67.81 65.85 74.45 70.68  
GPT3.5 -Turbo last-letter-concat 7.2         79.8
GPT4-1106 gsm8k 92.19 85.89 87.34 88.7 90.83  
GPT4-1106 bigbench_date 87.8 92.14 87.53 87.26 87.8  
GPT4-1106 bigbench_object_tracking 96.27 90.26 99.07 98.93 95.73  
GPT4-1106 csqa 79.69 85.59 79.85 79.85 80.34  
GPT4-1106 last-letter-concat 25.2         96.2

“This is very important to my career.” is used in emotion prompt