top of page

Can chatGPT pass a radiology board-style examination?

AND

Researchers conducted a prospective exploratory analysis published in the Radiology Journal to evaluate the effectiveness of an artificial intelligence (AI) system called ChatGPT on radiology board-style examination questions.



Background

ChatGPT, a powerful language model based on GPT-3.5, has been extensively trained on a vast amount of textual data using deep neural networks. While it hasn't specifically been trained on medical data, ChatGPT has demonstrated significant potential in medical writing and education. Consequently, physicians have begun utilizing ChatGPT alongside search engines to retrieve medical information.


Researchers are currently investigating the potential applications of ChatGPT in simplifying radiology reports, assisting in clinical decision-making, educating radiology students, facilitating differential and computer-aided diagnoses, and aiding in disease classification.


ChatGPT leverages its extensive training data to recognize word relationships and patterns, allowing it to generate responses that closely resemble human-like responses. However, it's important to note that it may occasionally generate factually incorrect answers. Nevertheless, ChatGPT has shown exceptional performance on various professional examinations, such as the U.S. Medical Licensing Examination, even without specific training in the medical domain.


Although ChatGPT holds promise for diagnostic radiology applications, including image analysis, its performance specifically in the field of radiology remains unknown. It is crucial for radiologists to understand the strengths and limitations of ChatGPT to utilize it effectively and confidently.


About the study

In this study, the researchers selected 150 multiple-choice questions that closely resembled the content, style, and difficulty level of prominent radiology board examinations, including the Canadian Royal College examination in diagnostic radiology and the American Board of Radiology Core and Certifying examinations.


To ensure the quality of the questions, two board-certified radiologists independently reviewed and verified them. Criteria such as the absence of images, plausible wrong answers, and similar answer lengths were considered during the review process. To ensure comprehensive coverage of radiology concepts, at least 10% of the questions were selected from specific topics identified by the Canadian Royal College.


The 150 questions were categorized by two other board-certified radiologists according to Bloom Taxonomy principles, distinguishing between lower-order and higher-order thinking questions.


To simulate real-world usage, all questions and answer choices were inputted into ChatGPT, and the researchers recorded the responses generated by the model. The passing score threshold set by the Royal College was considered as a benchmark for evaluating ChatGPT's overall performance.


The language of each ChatGPT response was subjectively assessed by two board-certified radiologists, rating the confidence level on a Likert scale ranging from one to four. A score of four indicated high confidence, while zero indicated no confidence.


The researchers also made qualitative observations of ChatGPT's behavior when prompted with the correct answer.


The study involved several analyses. First, the overall performance of ChatGPT was computed. The researchers then compared its performance across different question types and topics using the Fisher exact test, including distinctions related to physics or clinical aspects.


Subgroup analysis was performed on higher-order thinking questions, which were further categorized into four groups based on imaging description, clinical management, application of concepts, and disease associations.


Lastly, the Mann-Whitney U test was utilized to compare the confidence levels of ChatGPT responses between correct and incorrect answers. A p-value below 0.05 indicated a significant difference.


Study findings

ChatGPT demonstrated a passing performance of 69% on radiology board-style examination questions that did not include images.


The model's performance was notably better on questions that required lower-order thinking, involving knowledge recall and basic understanding, achieving an 84% accuracy rate compared to questions requiring higher-order thinking, where its accuracy dropped to 60%.


However, ChatGPT showed a strong performance on higher-order questions related to clinical management, scoring an impressive 89%. This success can be attributed to the availability of abundant disease-specific patient-facing data on the Internet.


On the other hand, ChatGPT struggled with higher-order questions that involved describing imaging results, performing calculations and classifications, and applying complex concepts.


Additionally, the model performed poorly on physics-related questions compared to clinical questions, achieving only a 40% accuracy rate in the former category and 73% in the latter. It's worth noting that ChatGPT consistently used confident language, even when providing incorrect responses (100% confidence).


The tendency of ChatGPT to generate incorrect yet human-like responses with confidence raises concerns, especially if it is relied upon as the sole source of information. This behavior currently limits the applicability of ChatGPT in the field of medical education.


Conclusions

In the absence of radiology-specific pretraining, ChatGPT demonstrated proficiency in answering questions that evaluated fundamental knowledge and comprehension of radiology. Remarkably, it achieved a near-passing score of 69% on a radiology board-style examination that did not include images.


However, it is crucial for radiologists to exercise caution and be mindful of the limitations associated with ChatGPT. One significant limitation is its tendency to provide incorrect responses while expressing unwavering confidence, as observed in the study. Consequently, relying solely on ChatGPT for practice or educational purposes is not supported by the study findings.


Nevertheless, as language models and large language models (LLMs) continue to advance, and as applications with radiology-specific pretraining become more prevalent, the potential of LLM-based models like ChatGPT in the field of radiology holds promise. The study's overall results are encouraging, pointing towards future advancements and opportunities in utilizing LLMs for radiology-related tasks.




留言


bottom of page