Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate , where to evaluate , and how to evaluate . Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey

Yupeng Chang

Xu Wang

Jindong Wang

Yuan Wu

Linyi Yang

Kaijie Zhu

Hao Chen

Xiaoyuan Yi

Cunxiang Wang

Yidong Wang

Wei Ye

Yue Zhang

Yi Chang

Philip S. Yu

Qiang Yang

Xing Xie

ACM Transactions on Intelligent Systems and Technology

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY 是一本学术期刊，以多学科的视角发表关于智能系统、适用算法和技术的最高质量论文。智能系统是使用人工智能 (AI) 技术提供重要服务（例如，作为更大系统的组件）以允许集成系统在现实世界中感知、推理、学习和智能行动的系统。ACM TIST 是每季出版（一年六期）。每期有 8-11 篇常规论文，大约有 20 个已发表的期刊页面或每篇论文 10,000 字。额外的参考文献、证明、图表或详细的实验结果可以作为单独的附录提交，而过长的论文将被自动拒绝。作者可以在其已发表论文的其他内容中包含仅在线附录，并鼓励与其他读者共享其代码和/或数据。

ACM Transactions on Intelligent Systems and Technology is an academic journal that publishes high-quality papers on intelligent systems, applicable algorithms, and technologies from a multidisciplinary perspective. Intelligent systems use artificial intelligence (AI) technologies to provide essential services, functioning as components of larger systems, enabling integrated systems to perceive, reason, learn, and act intelligently in the real world. ACM TIST is published quarterly, with six issues per year, each containing 8 to 11 regular papers, approximately 20 published journal pages, or up to 10,000 words per paper. Additional references, proofs, charts, or detailed experimental results may be submitted as separate appendices, and excessively long papers will be automatically rejected. Authors are allowed to include online-only appendices in their published papers and are encouraged to share their code and/or data with other readers.

《ACM Transactions on Intelligent Systems and Technology》是一本学术期刊，采用多学科的视角，发表关于智能系统、相关算法和技术的高质量论文。智能系统利用人工智能（AI）技术提供重要服务，作为更大系统的组成部分，使得集成系统能够在现实世界中进行感知、推理、学习和智能行动。该期刊每季度出版六期，每期包含8到11篇常规论文，约20页的已发表期刊内容，或每篇论文不超过10,000字。额外的参考文献、证明、图表或详细的实验结果可以作为附录提交，过长的论文将被自动拒绝。作者可以在已发表论文中包含仅在线附录，并鼓励与其他读者共享其代码和/或数据。

ACM Trans. Intell. Syst. Technol.

A Survey on Evaluation of Large Language Models

This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate, and offers invaluable insights to researchers in the realm of LLMs evaluation.

大型语言模型评估综述

Category	Quartile
计算机科学	4
计算机科学, 计算机人工智能	4
计算机科学, 计算机信息系统	4

A Survey on Evaluation of Large Language Models

ivySCI AI Smartly Parses PDF, Answers Researchers' Questions, and Helps You Understand Papers in Seconds

Journal Info

Category	Quartile
COMPUTER SCIENCE, INFORMATION SYSTEMS	1