AAHS - Assessing the Usability of ChatGPT responses compared to other online information in hand surgery

Back to 2025 Abstracts

Assessing the Usability of ChatGPT responses compared to other online information in hand surgery
Ophelie Z Lavoie-Gagne, MD; Oscar Shen, BA; Neal C. Chen, MD; Abhiram Bhashyam, MD
Massachusetts General Hospital, Boston, MA

Background:

Patients seek medical information online, often prior to evaluation with a specialist. ChatGPT is a natural language processing tool with potential to increase accessibility of health information, however its reliability remains unknown. Thus, this study's purpose was to (1) define the readability of online medical information, (2) evaluate how information source influences response quality, and (3) evaluate if medical consensus alters utility of online resources.

Methods:

Three phrases pertaining to hand surgery were formulated with varying degrees of medical consensus: "What is the cause of carpal tunnel syndrome” (high), "What is the cause of tennis elbow” (moderate), and "Platelet rich plasma for thumb arthritis” (low). Queries were posed twenty times each to Google, ChatGPT3.5, and ChatGPT4. Readability was assessed based on grade level. Reliability was quantified based on coverage and accuracy of responses. Rubrics were predetermined and responses scored on a Likert scale. Unsubstantiated statements were penalized to account for the negative impact of misinformation. Scores were compared via Mann-Whitney U tests with alpha set to 0.05.

Results:

For high-moderate consensus topics, Google had superior readability at an 8^th-grade reading level compared to college-sophomore for ChatGPT responses (p<0.0001) [Figure 1A]. For moderate consensus topics, Google had superior reliability compared to ChatGPT (p<0.05). For high consensus topics, ChatGPT4 had similar reliability to Google (p=0.421) while ChatGPT3.5 was inferior to both Google and ChatGPT4 (p<0.05). For low consensus topics, readability was poor throughout. Google slightly outperformed ChatGPT at a 12^th-grade versus college-freshman level, respectively (p<0.05) [Figure 1B]. Google trended towards superior reliability compared to ChatGPT4 (p=0.177) and had superior reliability to ChatGPT3.5 (p=0.01). For high-moderate consensus topics, Google had superior coverage of disease etiology (p<0.05). For low consensus topics, Google had inferior coverage of the procedure (p<0.05) but superior coverage of treatment efficacy and alternatives compared to ChatGPT (p<0.05).

Conclusions:

Medical consensus influenced information source utility. For topics with high-moderate consensus, Google remains a reliable and accessible source of information while ChatGPT is reliable but less accessible due to poor readability. Google and ChatGPT should be used with caution for low consensus topics. Balanced, nuanced discussion of low consensus topics are not available on public sources, but rather subscription-based sources that ChatGPT does not have access to incorporate into responses. There remains a critical role for providers to guide patients towards reputable information sources during shared decision-making, especially in the setting of conditions/treatments in hand surgery with low consensus.

Back to 2025 Abstracts