Abstract:
Objective We explored key factors influencing the performance of large language models (LLMs) in generating breast MRI reports to establish a cost-effective deployment strategy for large-scale diagnostic applications.
Methods This retrospective study integrated Chinese radiology reports from three medical institutions to construct training, validation, and testing datasets. Different LLMs (including ChatGPT, Llama3, Qwen2.5, and DeepSeek-R1-Distill-Llama3_8B) were fine-tuned to evaluate the impact of model architecture, parameter size, and pre-training data on performance. Impressions generated by the models were evaluated using BLEU, ROUGE, Cosine Similarity, and BERTScore. Diagnostic reasoning was evaluated via BI-RADS classification accuracy.
Results Llama3_8B model performance improved significantly with increased training data, with all key metrics showing significant improvement: BLEU (from 1.69×10−3 to 0.78), ROUGE-L (from 0.05 to 0.90), BERTScore (from 0.52 to 0.94), and Cosine Similarity (from 0.04 to 0.88). Models with different architectures, reasoning abilities, and parameter sizes showed minimal differences in inferential accuracy (66%–67%). Performance on external validation sets declined compared to the internal set (e.g., Llama3_8B's BERTScore dropped from 0.94 to 0.71, and accuracy dropped from 66% to 22%). In a reader study, LLM-generated reports outperformed human-written reports in completeness (4.56 vs. 4.46) and correctness (4.33 vs. 4.15).
Conclusions Fine-tuned LLMs demonstrate strong performance in generating breast MRI diagnostic impressions. This study provides insights into the fine-tuning process, offering hospitals a cost-effective strategy to adapt models to institution-specific needs, thereby enhancing diagnostic efficiency and reducing radiologists' workload.