大型语言模型助力乳腺MRI报告生成诊断的自动化研究

Large language model-enabled automation of diagnostic impressions generation inbreast MRI reports

  • 摘要:
    目的 探讨影响大型语言模型(large language models,LLMs)生成乳腺MRI报告性能的关键因素,并确定一种成本效益高的部署策略,以提升其在规模化影像诊断中的应用。
    方法 本研究回顾性整合三家医疗机构的中文放射报告,构建数据集。通过微调多种LLMs(包括ChatGPT、Llama3、Qwen2.5、DeepSeek-R1-Distill-Llama3_8B),评估模型架构、参数规模及预训练数据对性能的影响。采用BLEU、ROUGE、Cosine Similarity及BERTScore评价生成诊断的文本质量,并通过BI-RADS分类任务评估其诊断推理能力。
    结果 实验表明,增加训练数据可显著提升Llama3_8B模型的性能:BLEU值从1.69×10−3升至0.78,ROUGE-L从0.05升至0.90,BERTScore从0.52升至0.94,Cosine Similarity从0.04升至0.88。不同架构、推理能力及参数规模的模型在诊断准确率上差异较小(66%~67%)。模型在外部验证集的性能较内部验证集下降明显(如Llama3_8B的BERTScore从0.94降至0.71,准确率从66%降至22%)。报告分析显示,模型报告在完整度(4.56 vs. 4.46)和正确性(4.33 vs. 4.15)上优于人工报告。
    结论 经微调的LLMs在乳腺MRI印象生成任务中表现出色。本研究为模型微调提供了实践指导,为医疗机构提供可定制化的低成本部署方案,从而提升诊断效率并减轻放射科医师工作负荷。

     

    Abstract:
    Objective We explored key factors influencing the performance of large language models (LLMs) in generating breast MRI reports to establish a cost-effective deployment strategy for large-scale diagnostic applications.
    Methods This retrospective study integrated Chinese radiology reports from three medical institutions to construct training, validation, and testing datasets. Different LLMs (including ChatGPT, Llama3, Qwen2.5, and DeepSeek-R1-Distill-Llama3_8B) were fine-tuned to evaluate the impact of model architecture, parameter size, and pre-training data on performance. Impressions generated by the models were evaluated using BLEU, ROUGE, Cosine Similarity, and BERTScore. Diagnostic reasoning was evaluated via BI-RADS classification accuracy.
    Results Llama3_8B model performance improved significantly with increased training data, with all key metrics showing significant improvement: BLEU (from 1.69×10−3 to 0.78), ROUGE-L (from 0.05 to 0.90), BERTScore (from 0.52 to 0.94), and Cosine Similarity (from 0.04 to 0.88). Models with different architectures, reasoning abilities, and parameter sizes showed minimal differences in inferential accuracy (66%–67%). Performance on external validation sets declined compared to the internal set (e.g., Llama3_8B's BERTScore dropped from 0.94 to 0.71, and accuracy dropped from 66% to 22%). In a reader study, LLM-generated reports outperformed human-written reports in completeness (4.56 vs. 4.46) and correctness (4.33 vs. 4.15).
    Conclusions Fine-tuned LLMs demonstrate strong performance in generating breast MRI diagnostic impressions. This study provides insights into the fine-tuning process, offering hospitals a cost-effective strategy to adapt models to institution-specific needs, thereby enhancing diagnostic efficiency and reducing radiologists' workload.

     

/

返回文章
返回