首页|生成式人工智能在生成影像学报告方面的表现评估

生成式人工智能在生成影像学报告方面的表现评估

扫码查看
目的 评估 2 种生成式人工智能(AI)在生成腹部影像学报告方面的表现,并与人类医师进行比较。方法 回顾性研究 2023 年 6 月至 2024 年 5 月在中山大学附属第三医院接受腹部CT和MRI检查的 300 例患者的影像学报告。使用生成式AI模型ERNIE 4。0 和Claude 3。5 Sonnet对 300 例患者的影像学所见重新生成影像学报告,由 5 名放射科医师采用五点Likert量表(1 表示强烈不同意,5 表示强烈同意)评估其完整性、准确性、表达、幻觉和无修改接受度。采用Friedman和Nemenyi检验进行统计学分析。比较生成式AI与人类医师的表现差异。结果 研究共纳入 300 例患者的影像学报告。在完整性方面,Claude 3。5 Sonnet与人类医师相当,均优于ERNIE 4。0[(4。86±0。37)分 vs。(4。76±0。46)分 vs。(4。40±0。64)分,前两者比较P=0。200,前两者与后者比较P均<0。01]。在准确性方面,人类医师优于 2 种AI模型[(4。96±0。22)分 vs。(4。66±0。57)分 vs。(4。69±0。57)分,前者与后两者比较P均<0。01]。在无修改可接受度方面,Claude 3。5 Sonnet与人类医师相当,均优于ERNIE 4。0[(4。64±0。53)分 vs。(4。69±0。54)分 vs。(4。30±0。59)分,前两者比较P=0。595,前两者与后者比较P均<0。01]。在表达和幻觉上,三者比较差异无统计学意义(P均>0。05)。结论 Claude 3。5 Sonnet生成的影像学报告与人类医师水平相当。这提示先进的生成式AI有潜力辅助人类医师的工作,有助于提高效率并减轻认知负担。
Evaluation of the performance of generative artificial intelligence in generating radiology reports
Objective To evaluate the performance of two categories of generative artificial intelligence(AI)in generating abdominal radiology reports,and compare with the performance of radiologists.Methods The radiology reports of 300 patients who underwent abdominal CT scan and MRI in the Third Affiliated Hospital of Sun Yat-sen University from June 2023 to May 2024 were retrospectively studied.The generative AI models of ERNIE 4.0 and Claude 3.5 Sonnet were utilized to re-generate radiology reports of 300 patients.Five radiologists evaluated the comprehensiveness,accuracy,expressiveness,hallucinations,and acceptance without revision of the impressions using a five-point Likert scale.Friedman test and Nemenyi test were used to compare the performance between two models and radiologists.Results CT and MRI reports from 300 patients were evaluated.For comprehensiveness,Claude 3.5 Sonnet was on a par with human physicians,and both were superior to ERNIE 4.0(scores of 4.86±0.37 vs.4.76±0.46 vs.4.40±0.64;comparison between the first two,P=0.200,comparison between the first two and the third,both P<0.01).For accuracy,Radiologists outperformed both ERNIE 4.0 and Claude 3.5 Sonnet(scores of 4.96±0.22 vs.4.66±0.57 vs.4.69±0.57;comparison between the first and the latter two,both P<0.01).For acceptance without revision,Claude 3.5 Sonnet was on a par with human physicians,and both were superior to ERNIE 4.0(scores of 4.64±0.53 vs.4.69±0.54 vs.4.30±0.59;comparison between the first two,P=0.595,comparison between the first two and the third,both P<0.01).Expressiveness and hallucinations metrics showed minimal variations among the three(all P>0.05).Conclusions Claude 3.5 Sonnet yields comparable performance to radiologists in generating radiology reports,indicating that advanced generative AI has the potential to assist radiologists,improve the work efficiency and reduce cognitive burden.

Generative artificial intelligenceNatural language processingRadiology reportAbdomen

黎超、陈优美、段亚妮、陈耀萍、陈秀珍、覃杰

展开 >

中山大学附属第三医院放射科,广东 广州 510630

生成式人工智能 自然语言处理 影像学报告 腹部

2024

新医学
中山大学

新医学

CSTPCD
影响因子:0.8
ISSN:0253-9802
年,卷(期):2024.55(11)