American Association for Hand Surgery

AAHS Home AAHS Home Past & Future Meetings Past & Future Meetings
Facebook    Twitter

Back to 2025 Abstracts


Carpal Tunnel Surgery Information: Comparison of AI generated information with Google search for common patient questions
Brian M Phelps, MD; Sree Vemu, MD; Chia H Wu, MD; Shari R Liberman, M.D.
Houston Methodist Hospital, Houston, TX

Background: The emergence of artificial intelligence (AI) tools and large language models (LLM) like ChatGPT have the potential to enhance patient knowledge. This study focuses on the role of such AI tools in providing accurate and reliable information for patients exploring carpal tunnel surgery options. The current study aims to compare the efficacy of ChatGPT and Google Bard against Google Web Search, the predominant search engine in the U.S., in delivering relevant and accurate information.

Methods: The top 10 questions and answers from Google's "People also ask" section using "carpal tunnel release" as a search term were collected. These same questions were then asked on ChatGPT using versions 3.5 and 4 (with WebChat GPT and KeyMate AI plugins), and Google Bard (Figure 1). Repetitive or irrelevant questions were excluded. Google Search and ChatGPT 4 versions provided sources while ChatGPT 3.5 and Google Bard do not. All responses were examined for accuracy by two board-certified orthopedic hand surgeons blinded to the source of answers. They were graded on a scale as follows: 1 (incorrect), 2 (mixed correct and incorrect), 3 (correct but not comprehensive), 4 (comprehensive and correct). Ultimately, grading categorization for accuracy was divided into two groups: grades 1 and 2 as inaccurate, and grades 3 and 4 as accurate. Cohen's Kappa coefficients were used to determine interobserver reliability for the answer assessment.

Results: This study showed that AI tools delivered substantial knowledge of carpal tunnel release and were better than the traditional Google search (Table 1). Only one Google search answer was deemed accurate (10%) whereas, ChatGPT 3.5, ChatGPT 4 with Webchat, ChatGPT with KeyMate AI, and Google Bard were 70%, 70%, 100%, and 90% accurate respectively. As to comprehensiveness, no Google search or Google Bard answers met that standard (grade 4), whereas 20% of ChatGPT 3.5, 20% of ChatGPT4 with Webchat, and 50% of ChatGPT with KeyMate AI were deemed comprehensive by both surgeons. The most common sources for Google search and both ChatGPT 4 versions were academic and non-profit websites. The interobserver reliability between the two evaluators resulted in a Cohen's kappa coefficient of 0.69 indicating substantial agreement.

Conclusion: LLMs have promise as an information source for patients with more accurate information than a traditional Google search. However, reliability and comprehensiveness of answers need further validation. The tool's capability to provide trustworthy information must align with the objectives of both physicians and patients before widespread usage.


Back to 2025 Abstracts