RESIDENT: The Future of AI Assistants: Evaluating the Performance of GPT-4 and GPT-3.5 on Residency-Level Psychiatry Questions

Evaluate the session

Presenting Author(s): Abhinav Pillai

Co-Author(s): Tan Li, Sharon Parappally-Joseph, Martin Vetter

Date and time: 23 Mar 2024 from 14:25 to 14:40

Location: Hawthorn C

Learning Objectives

1. Understanding the current capabilities and limitations of LLMs in answering psychiatric medical questions;
2. Exploring the ethical and regulatory considerations of using LLMs in psychiatric education and practice; and
3. Exploring future directions for LLMs and other AI in psychiatry.

Literature References

1. Singhal, K., Azizi, S., Tu, T. et al. (2023). Large language models encode clinical knowledge. Nature, 620, 172–180. https://doi.org/10.1038/s41586-023-06291-2
2. Kung, T. H. et al (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2(2), e0000198. https://doi.org/10.1371/journal.pdig.0000198
3. Reese, J. T. et al. (2023). On the limitations of large language models in clinical diagnosis. medRxiv. https://doi.org/10.1101/2023.07.13.23292613
4. Chen, S. et al. (2023). LLM-empowered Chatbots for Psychiatrist and Patient Simulation: Application and Evaluation. https://doi.org/10.48550/arXiv.2305.13614

Abstract

Background:

Large language models (LLMs) such as GPT-4 and GPT-3.5 have shown remarkable capabilities in natural language understanding and generation, but their performance on complex medical questions that require psychiatric knowledge has not been thoroughly evaluated. Here, we aim to assess how well these LLMs respond to standard psychiatry questions used in medical
education, which has implications for both medical training and clinical practice.

Methods:

We selected 59 physician-generated, multiple-choice psychiatry questions used for resident academic half days, covering three categories: diagnosis, management, and miscellaneous topics. We submitted these questions to GPT-4 and GPT-3.5 via the ChatGPT interface and recorded their responses, including the correct answers and the justifications. We evaluated the accuracy and relevance of the responses from each model and compared their performance using a paired t-test.

Results:

GPT-4 achieved an accuracy of 69.5%, while GPT-3.5 achieved an accuracy of 59.3%, indicating a significant difference in performance between the models (p<0.05). GPT-4 outperformed GPT-3.5 in the diagnosis and management categories. Justifications provided by both models were generally coherent and informative but sometimes contained factual errors or irrelevant details.

Conclusion:

Our results suggest that LLMs can provide reasonable answers to psychiatry questions that involve clinical reasoning, but they are not reliable or consistent enough to replace human experts. The use of LLMs in psychiatric education and practice will require oversight to ensure their safety and usefulness. Future research should explore the potential of domain-specific LLMs or custom-tailoring of general-purpose LLMs for better handling of psychiatry-related questions.

Back

Add to Calendar