Orca 2: Enhancing Reasoning in Smaller Language Models - Prompts used in Evaluation

cover

29 May 2024

Authors:

(1) Arindam Mitra;

(2) Luciano Del Corro, work done while at Microsoft;

(3) Shweti Mahajan, work done while at Microsoft;

(4) Andres Codas, denote equal contributions;

(5) Clarisse Simoes, denote equal contributions;

(6) Sahaj Agarwal;

(7) Xuxi Chen, work done while at Microsoft;;

(8) Anastasia Razdaibiedina, work done while at Microsoft;

(9) Erik Jones, work done while at Microsoft;

(10) Kriti Aggarwal, work done while at Microsoft;

(11) Hamid Palangi;

(12) Guoqing Zheng;

(13) Corby Rosset;

(14) Hamed Khanpour;

(15) Ahmed Awadall.

Table of Links

Abstract and Introduction

Teaching Orca 2 to be a Cautious Reasoner

Technical Details

Experimental Setup

Evaluation Results

Conclusions and References

A. AGIEval Subtask Metrics

B. BigBench-Hard Subtask Metrics

C. Evaluation of Grounding in Abstractive Summarization

D. Evaluation of Safety

E. Prompts used in Evaluation

F. Illustrative Example from Evaluation Benchmarks and Corresponding Model Outpu

E Prompts used in Evaluation

We provide a list of prompts used for evaluation below:

Table 15: Table describes the prompts used for evaluating all models with empty. The prompts are simple and only aim at giving the models hints about answer format to improve the parsing of model responses. For tasks, where the question were formatted as a prompt, the input is used as is. Examples from all datasets are shown in Appendix F

This paper is available on arxiv under CC 4.0 license.

Orca 2: Enhancing Reasoning in Smaller Language Models - Evaluation of Safety

Orca 2: Enhancing Reasoning in Smaller Language Models - Example from Benchmarks and Output