American Association for Hand Surgery

AAHS Home AAHS Home Past & Future Meetings Past & Future Meetings
Facebook    Twitter

Back to 2026 ePosters


Can ChatGPT Help Hand Surgeons Answer Patient Questions, Streamline Billing, and Conduct Diagnostic Coding?
Sohail R Daulat, B.S.1, Romir P Parmar, B.S.2, Tyler Brady, B.S.2, Jae Won Song, B.S.3, Lloyd Champagne, M.D.4; Cameron Roth, M.D.5
(1)University of Arizona College of Medicine - Tucson, Tucson, AZ, (2)University of Arizona College of Medicine - Phoenix, Phoenix, AZ, (3)Rocky Vista University College of Osteopathic Medicine, Englewood, CO, (4)Arizona Center for Hand to Shoulder Surgery, Phoenix, AZ, (5)OrthoArizona, Glendale, AZ

Introduction: Large language models (LLMs) like ChatGPT show promise in healthcare, but the performance of newer models-such as GPT-4, GPT-4o, and GPT-o1-has not been specifically evaluated in hand surgery applications. Their utility in supporting patient education, billing, and diagnostic coding remains underexplored.
Methods: LLMs were tested using standardized prompts for five common hand conditions: distal radius fractures, trigger finger, carpal tunnel syndrome, cubital tunnel syndrome, and ganglion cysts. Three tasks were evaluated: (1) patient education (15 common questions), (2) CPT code prediction from de-identified operative notes, and (3) ICD-10 code prediction from clinical vignettes (HPI, physical exam, and imaging). Readability was assessed using the SMOG Index, Flesch-Kincaid Grade Level, and Flesch Reading Ease. Accuracy for billing and diagnostic tasks was measured using F? scores. Statistical analyses included ANOVA, Kruskal-Wallis, and pairwise comparisons (P = 0.05).
Results: GPT-4o produced the most readable responses (median SMOG 14.9), while GPT-o1 generated more complex text (median SMOG 17.1, P<0.01). Responses to surgical questions were harder to read than diagnostic ones (P<0.01). All models achieved near-perfect accuracy in detecting CPT codes (median F? = 1.00). GPT-o1 performed best in predictive billing, though differences were not significant. For diagnostic coding, all models were more accurate in identifying general diagnoses than specific ICD-10 codes. Accuracy improved with the inclusion of exam and imaging data, but no model significantly outperformed the others.
Discussion: LLMs reliably detected CPT codes and showed improved diagnostic performance with more clinical data, but struggled with precise ICD-10 coding. All models produced educational content above the recommended reading level, potentially limiting accessibility. These findings underscore both the promise and limitations of LLMs in hand surgery applications.
Back to 2026 ePosters