Back to 2026 ePosters
Evaluating ChatGPT as a Tool for Surgical Coding in Hand Trauma
Ermina Lee, BS
1, Tiffany Shi, PhD
1, Victoria Lee, BS
1, Kelly Spiller, MD
2, Maleeh Effendi, MD
2, Ryan Gobble, MD
2; Ann R. Schwentker, MD
2,3(1)University of Cincinnati College of Medicine, Cincinnati, OH, (2)University of Cincinnati Medical Center, Cincinnati, OH, (3)Plastic Surgery, Cincinnati Children's Hospital Medical Center, Cincinnati, OH
Background: Medical billing in hand trauma is inherently complex, requiring anatomical understanding and procedural clarity to correctly assign unbundled Current Procedural Terminology (CPT) codes. While machine learning has shown promise in surgical billing, the role of publicly available, open-source artificial intelligence (AI) in hand trauma is unknown. This study evaluates the performance of Chat Generative Pre-Trained Transformer (ChatGPT) Medical Coding AI in assigning CPT codes and associated Relative Value Units (RVUs) for hand trauma.
Methods: Fifty hand trauma operations of varying complexity from 2018-2023 were selected from retrospective review. Operative reports were entered into ChatGPT-4.0 Medical Coding AI with standard and augmented prompts, including the Eaton Hand Coding Manual as a reference (Figure 1). Coding accuracy was assessed based on documentation support. Operative report quality was assessed with a hand surgeon's adaptation of the Structured Assessment Format for Evaluating Operative Reports (SAFE-OR) questionnaire. Additionally, 302 CPT codes across 100 operations were queried for RVU assignment by three users using the Physician Fee Schedule as benchmarks.
Results:The 50 operative notes described 126 repairs and 174 procedures, with high quality documentation (mean SAFE-OR: 96%) and no significant differences between author types (attending vs. resident; p=1.00). However, nearly a quarter (24.6%) of billed CPT codes lacked documentation support, and 74 codes were omitted despite procedural description. ChatGPT initially had low accuracy (30.5%); this improved to 54% with final prompting, and to 76% when using the Eaton reference. Overcoding by ChatGPT was observed in 31 operative notes (62%), while undercoding by billers occurred in 14 notes (28%). Common ChatGPT overcoding involved add-ons such as CPT15101 (skin grafts), CPT15272 (skin substitutes), and CPT64910 (nerve conduit repair). For RVU assignment, ChatGPT underreported total RVUs by 14.1%, with a mean per-operation error of 2.58 RVUs. Bland-Altman analysis demonstrated no systemic bias (mean difference = 0) between users but revealed high variability and low precision, with a standard deviation of 12.78 and limits of agreement from -25.05 to 25.05.
Conclusions:ChatGPT shows potential in assisting hand trauma surgical coding but remains error-prone and inconsistent. Performance depends heavily on prompt structure and exhibited leading bias. Since AI is already marketed for medical billing and used by insurers to review claims, model refinement for unbundled procedures is crucial.
Back to 2026 ePosters