UPDF AI

Are Medium-Sized Transformers Models still Relevant for Medical Records Processing?

Boammani Aser Lompo,Thanh-Dung Le

2024 · ArXiv: 2404.10171
引用数 3

TLDR

This study aims to classify numerical values extracted from medical records into seven distinct physiological categories using CamemBERT-bio by incorporating keyword embeddings to refine the model's attention mechanisms and adopting a number-agnostic strategy by removing numerical values from the text to encourage context-driven learning.

摘要

As large language models (LLMs) become the standard in many NLP applications, we explore the potential of medium-sized pretrained transformer models as a viable alternative for medical record processing. Medical records generated by healthcare professionals during patient admissions remain underutilized due to challenges such as complex medical terminology, the limited ability of pretrained models to interpret numerical data, and the scarcity of annotated training datasets. Objective: This study aims to classify numerical values extracted from medical records into seven distinct physiological categories using CamemBERT-bio. Previous research has suggested that transformer-based models may underperform compared to traditional NLP approaches in this context. Methods: To enhance the performance of CamemBERT-bio, we propose two key innovations: (1) incorporating keyword embeddings to refine the model's attention mechanisms and (2) adopting a number-agnostic strategy by removing numerical values from the text to encourage context-driven learning. Additionally, we assess the criticality of extracted numerical data by verifying whether values fall within established standard ranges. Results: Our findings demonstrate significant performance improvements, with CamemBERT-bio achieving an F1 score of 0.89 - an increase of over 20% compared to the 0.73 F1 score of traditional methods and only 0.06 units lower than GPT-4. These results were obtained despite the use of small and imbalanced training datasets.

参考文献
引用文献