American Association for Hand Surgery

AAHS Home AAHS Home Past & Future Meetings Past & Future Meetings
Facebook    Twitter

Back to 2026 ePosters


Machine Fabrication And Hallucination In Evaluating The Performance And Rationale Of Gpt-4o And Copilot In Hand Surgery Assessment Questions
Mohammad S Kahrizi, M.D.1, Benjamin Nieves, B.S.2, Peter Murray, MD3; Keith T Aziz, M.D.1
(1)Mayo Clinic Florida, Jacksonville, FL, (2)University of Puerto Rico, San Juan, PR, (3)Department of Orthopedic Surgery/Division of Hand Surgery, The Mayo Clinic, Jacksonville, FL

INTRODUCTION: Prior literature has shown that AI-generated medical content may include fabricated or inaccurate references, despite providing answers that seem credible and accurate. Some examples have shown error and fabrication rates exceeding 50%. This study evaluates the performance of GPT-4o and Copilot in answering hand surgery questions from the Hand Surgery Self Assessment Exam, as well as the validity and accuracy of references provided to justify the answer. We hypothesized that there would be significant fabrications and machine hallucinations.
MATERIALS AND METHODS: This study, conducted in November 2024, compared GPT-4o and Copilot in answering 200 multiple-choice questions from the 2024 ASSH Self-Assessment Exam. AI accuracy, reference fabrication, and citation integrity were analyzed. Statistical tests, including Chi-Square, Cohen's Kappa, and regression analyses, assessed performance differences, agreement, and fabrication rates. SPSS was used for all analyses.
RESULTS: A total of 200 multiple-choice questions from the 2024 ASSH Self-Assessment Exam were analyzed. GPT-4o correctly answered 77.5% of questions, while Copilot achieved 59.5% (?² = 37.595, p < 0.001). Fabricated references were prevalent, with GPT-4o fabricating in 97.5% of questions and Copilot in 90.61%. The total number of 1309 references of 2051 (63.82%) for GPT-4o and 1192 of 1827 (65.24%) in Copilot were fabricated with no significant difference between models (p = 0.631). GPT-4o altered both title and author in 82.05% of fabricated references, whereas Copilot did so in 73.78%. Agreement with ASSH-suggested references was low, with GPT-4o at 17.0% and Copilot at 12.15% (p = 0.011, Cohen's d = 0.138). Regression analysis showed weak correlation between total number of provided references and fabrication for GPT-4o (R² = 4.1%, p = 0.004) but stronger for Copilot (R² = 30.4%, p < 0.001). Fabrication rates did not significantly differ based on response correctness for either model.
CONCLUSION: Current AI models frequently fabricated references, altering both author names and study titles. Low rates of agreement with ASSH suggested references, underscoring concerns about AI-generated citations. These findings emphasize the need for rigorous verification in the future models to prevent misinformation in medical education and research.
Back to 2026 ePosters