ChatGPT passes radiology board exam, but still has limitations

The latest version of the artificial intelligence chatbot ChatGPT has passed a radiology board-style examination. Researchers put ChatGPT to the test using 150 multiple-choice questions modelled on the Canadian Royal College and American Board of Radiology exams. This breakthrough underscores the vast potential of AI in medical fields, yet it also reveals certain limitations that affect its dependability, two studies said. ChatGPT, a deep-learning model developed by OpenAI, is known for generating humanlike responses based on the input it receives. Its pattern recognition abilities allow it to interpret and respond to vast amounts of data, but it sometimes produces factually incorrect responses because of the absence of a source of truth in its training data. “The use of large language models like ChatGPT is rapidly expanding and will only continue to grow,” said Dr Rajesh Bhayana, an abdominal radiologist and technology lead at University Medical Imaging Toronto, Toronto General Hospital. “Our research offers valuable insight into how ChatGPT performs in a radiology setting, emphasising its immense potential while shedding light on current reliability issues.” ChatGPT's usage and influence have been growing significantly. Notably, it was recently named the fastest-growing consumer application in history. It is also being integrated into popular search engines like Google and Bing, which both physicians and patients use for medical inquiries. The AI chatbot managed to correctly answer 69 per cent of the questions, just short of the passing grade of 70 per cent. However, it showed a noticeable gap in performance between lower-order thinking (84 per cent) and higher-order thinking questions (60 per cent), particularly struggling with descriptions of imaging findings, calculations and classifications, and the application of concepts. Given that the AI has not received any radiology-specific training, these struggles were not unexpected. A newer version — GPT-4 — was released in March, the release was an improved version of the AI including enhanced advanced reasoning capabilities. In a follow-up study, GPT-4 answered 81 per cent of the same questions correctly, exceeding the passing threshold and outperforming its predecessor, GPT-3.5. Despite these improvements, GPT-4 did not show any progress on lower-order thinking questions and answered 12 questions incorrectly that GPT-3.5 had answered correctly. This inconsistency raises questions about the AI's reliability in information gathering. “ChatGPT gave accurate and confident answers to some challenging radiology questions, but then made some very illogical and inaccurate assertions,” said Dr Bhayana. “Given how these models function, the inaccurate responses should not be surprising.” The studies noted a tendency of ChatGPT to produce inaccurate responses, termed hallucinations. Although less frequent in GPT-4, this tendency still limits the chatbot's current usability in medical education and practice. Despite the limitations, the researchers see potential in using ChatGPT to spark ideas and aid in the medical writing process and data summarisation, as long as the information is fact-checked. “To me, this is its biggest limitation. At present, ChatGPT is best used to spark ideas, help start the medical writing process and in data summarisation. If used for quick information recall, it always needs to be fact-checked,” Dr Bhayana said.

ChatGPT passes radiology board exam, but still has limitations

AI shows promise in medical field despite current limitations, demonstrates notable improvement between iterations