Phi-3-Vision's Triumphant Performance on Key Multimodal Benchmarks

8 Jul 2025

Table of Links

Abstract and 1 Introduction

2 Technical Specifications

3 Academic benchmarks

4 Safety

5 Weakness

6 Phi-3-Vision

6.1 Technical Specifications

6.2 Academic benchmarks

6.3 Safety

6.4 Weakness

References

A Example prompt for benchmarks

B Authors (alphabetical)

C Acknowledgements

6.2 Academic benchmarks

We report in Table 2 the evaluation results of Phi-3-Vision on nine open-source academic benchmarks. These benchmarks evaluate reasoning and perceptual capabilities on visual and text inputs and can be grouped in three categories: Science, Charts, and Generic knowledge. We compare Phi-3-Vision with the following baselines: MM1-3B-Chat [MGF+ 24], MM1-7B-Chat [MGF+ 24], Llava-1.6 Vicuna 7B [LLLL23], Llava-1.6 Llama3-8B [LLL+ 24], Qwen-VL-Chat [BBY+ 23], Claude 3 Haiku [Ant24], Gemini 1.0 Pro V [TAB+ 23], and GPT-4V-Turbo. Our performance quality assessment setup used the same evaluation pipeline for all the baselines to ensure a fair comparison, with the exception of MM1-3B-Chat. We just copied and pasted their published numbers since the model is not publicly available.

Our evaluation setup aimed to mimic scenarios where regular users interact with a multi-modal model, i.e., users who are not experts in prompt engineering or know special techniques that can improve performance. For this reason, we adopted the evaluation setting used in Llava-1.5 [LLLL23]. In this setup, the prompts include instructions to select a single letter corresponding to an answer from a list of given options, or answer with a single word or phrase. In our prompts, we did not use specific tokens for multiple-choice questions. Moreover, we did not scale or pre-process any image in our benchmarking system. We placed the images as the first item in the prompts, except on the MMMU dataset where the prompts interleave the images anywhere in the question or the answers. Lastly, our evaluation setup only considered a 0-shot format. Because of these evaluation parameters, our reported numbers can differ from the published numbers of the considered baselines.

Authors:

(1) Marah Abdin;

(2) Sam Ade Jacobs;

(3) Ammar Ahmad Awan;

(4) Jyoti Aneja;

(5) Ahmed Awadallah;

(6) Hany Awadalla;

(7) Nguyen Bach;

(8) Amit Bahree;

(9) Arash Bakhtiari;

(10) Jianmin Bao;

(11) Harkirat Behl;

(12) Alon Benhaim;

(13) Misha Bilenko;

(14) Johan Bjorck;

(15) Sébastien Bubeck;

(16) Qin Cai;

(17) Martin Cai;

(18) Caio César Teodoro Mendes;

(19) Weizhu Chen;

(20) Vishrav Chaudhary;

(21) Dong Chen;

(22) Dongdong Chen;

(23) Yen-Chun Chen;

(24) Yi-Ling Chen;

(25) Parul Chopra;

(26) Xiyang Dai;

(27) Allie Del Giorno;

(28) Gustavo de Rosa;

(29) Matthew Dixon;

(30) Ronen Eldan;

(31) Victor Fragoso;

(32) Dan Iter;

(33) Mei Gao;

(34) Min Gao;

(35) Jianfeng Gao;

(36) Amit Garg;

(37) Abhishek Goswami;

(38) Suriya Gunasekar;

(39) Emman Haider;

(40) Junheng Hao;

(41) Russell J. Hewett;

(42) Jamie Huynh;

(43) Mojan Javaheripi;

(44) Xin Jin;

(45) Piero Kauffmann;

(46) Nikos Karampatziakis;

(47) Dongwoo Kim;

(48) Mahoud Khademi;

(49) Lev Kurilenko;

(50) James R. Lee;

(51) Yin Tat Lee;

(52) Yuanzhi Li;

(53) Yunsheng Li;

(54) Chen Liang;

(55) Lars Liden;

(56) Ce Liu;

(57) Mengchen Liu;

(58) Weishung Liu;

(59) Eric Lin;

(60) Zeqi Lin;

(61) Chong Luo;

(62) Piyush Madan;

(63) Matt Mazzola;

(64) Arindam Mitra;

(65) Hardik Modi;

(66) Anh Nguyen;

(67) Brandon Norick;

(68) Barun Patra;

(69) Daniel Perez-Becker;

(70) Thomas Portet;

(71) Reid Pryzant;

(72) Heyang Qin;

(73) Marko Radmilac;

(74) Corby Rosset;

(75) Sambudha Roy;

(76) Olatunji Ruwase;

(77) Olli Saarikivi;

(78) Amin Saied;

(79) Adil Salim;

(80) Michael Santacroce;

(81) Shital Shah;

(82) Ning Shang;

(83) Hiteshi Sharma;

(84) Swadheen Shukla;

(85) Xia Song;

(86) Masahiro Tanaka;

(87) Andrea Tupini;

(88) Xin Wang;

(89) Lijuan Wang;

(90) Chunyu Wang;

(91) Yu Wang;

(92) Rachel Ward;

(93) Guanhua Wang;

(94) Philipp Witte;

(95) Haiping Wu;

(96) Michael Wyatt;

(97) Bin Xiao;

(98) Can Xu;

(99) Jiahang Xu;

(100) Weijian Xu;

(101) Sonali Yadav;

(102) Fan Yang;

(103) Jianwei Yang;

(104) Ziyi Yang;

(105) Yifan Yang;

(106) Donghan Yu;

(107) Lu Yuan;

(108) Chengruidong Zhang;

(109) Cyril Zhang;

(110) Jianwen Zhang;

(111) Li Lyna Zhang;

(112) Yi Zhang;

(113) Yue Zhang;

(114) Yunan Zhang;

(115) Xiren Zhou.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI

Up Next →

Benchmarking Multimodal Safety: Phi-3-Vision's Robust RAI Performance