Back to 2026 ePosters
Readability, Accuracy, and Lexical Diversity of New ChatGPT Models for Common Carpal Tunnel Syndrome Questions
Romir P Parmar, B.S.
1, Sohail R Daulat, B.S.
2, Rakshit Shah, Ph.D.
2, Tyler Brady, B.S.
1, Jae Won Song, MBA
3, Michael Montague, M.D.
4; Cameron Roth, M.D.
4(1)University of Arizona College of Medicine - Phoenix, Phoenix, AZ, (2)University of Arizona College of Medicine - Tucson, Tucson, AZ, (3)Rocky Vista University College of Osteopathic Medicine, Englewood, CO, (4)OrthoArizona, Glendale, AZ
Introduction: Carpal tunnel syndrome (CTS) is a prevalent neuropathy that significantly affects quality of life and is commonly researched online by patients before seeking medical care. Large language models (LLMs) like ChatGPT are increasingly used for health information, yet concerns remain regarding the accuracy, readability, and complexity of their responses. Previous studies have assessed older ChatGPT models but have not comprehensively compared newer versions or evaluated lexical diversity. This study aimed to compare the accuracy, readability, and lexical diversity of responses from ChatGPT-4, ChatGPT-4o, and ChatGPT-o1 to common CTS-related patient questions.
Methods: Six frequently asked CTS questions were submitted to each model, and responses were independently graded by two board-certified hand surgeons using evidence-based guidelines. Accuracy was categorized as "correct," "partially correct," or "incorrect" on a word-level basis. Lexical diversity was assessed using the Measure of Textual Lexical Diversity (MTLD), and readability was evaluated using the Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease Score (FRES), and Simple Measure of Gobbledygook (SMOG). Statistical comparisons among models were conducted using ANOVA or Kruskal-Wallis tests with P < 0.05 as the threshold for significance.
Results: All three ChatGPT models demonstrated comparable overall accuracy, with no significant differences between them, though a notable difference was observed between Questions 3 and 5 (P = 0.035). ChatGPT-4o produced the most lexically diverse response for Question 1 (P = 0.031), while ChatGPT-o1 had the highest lexical diversity for Question 5 (P = 0.019). Readability scores varied significantly, with ChatGPT-4o generating the most readable responses (FRES: 48.5; FKGL: 9.5; SMOG: 11.8). In contrast, ChatGPT-o1 produced the most complex content, reflected by lower FRES and higher FKGL and SMOG scores. These differences in readability were statistically significant across all metrics (P < 0.05).
Conclusions: While all ChatGPT models provided similarly accurate responses to common CTS questions, notable differences emerged in readability and lexical diversity. ChatGPT-4o offered the most accessible and patient-friendly content, whereas ChatGPT-o1 delivered more complex but lexically rich responses, highlighting important trade-offs for AI-
Back to 2026 ePosters