Use of n-grams and K-means clustering to classify data from free text bone marrow reports

Natural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classic...

Full description

Saved in:

Bibliographic Details
Main Author:	Richard F. Xiang (Author)
Format:	Book
Published:	Elsevier, 2024-12-01T00:00:00Z.
Subjects:	article
Online Access:	Connect to this object online.
Tags:	Add Tag No Tags, Be the first to tag this record!

MARC


LEADER	00000 am a22000003u 4500
001	doaj_c86d09fefc36442d9aaafa5b84a339c4
042			\|a dc
100	1	0	\|a Richard F. Xiang \|e author
245	0	0	\|a Use of n-grams and K-means clustering to classify data from free text bone marrow reports
260			\|b Elsevier, \|c 2024-12-01T00:00:00Z.
500			\|a 2153-3539
500			\|a 10.1016/j.jpi.2023.100358
520			\|a Natural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classical NLP algorithms. In this manuscript, we examined how an automated classical NLP algorithm was able to classify portions of bone marrow report text into their appropriate sections. A total of 1480 bone marrow reports were extracted from the laboratory information system of a tertiary healthcare network. The free text of these bone marrow reports were preprocessed by separating the reports into text blocks and then removing the section headers. A natural language processing algorithm involving n-grams and K-means clustering was used to classify the text blocks into their appropriate bone marrow sections. The impact of token replacement of numerical values, accession numbers, and clusters of differentiation, varying the number of centroids (1-19) and n-grams (1-5), and utilizing an ensemble algorithm were assessed. The optimal NLP model was found to employ an ensemble algorithm that incorporated token replacement, utilized 1-gram or bag of words, and 10 centroids for K-means clustering. This optimal model was able to classify text blocks with an accuracy of 89%, suggesting that classical NLP models can accurately classify portions of marrow report text.
546			\|a EN
690			\|a Hematologic pathology
690			\|a Bone marrow
690			\|a K-means clustering
690			\|a n-grams
690			\|a Machine learning
690			\|a Natural language processing
690			\|a Computer applications to medicine. Medical informatics
690			\|a R858-859.7
690			\|a Pathology
690			\|a RB1-214
655	7		\|a article \|2 local
786	0		\|n Journal of Pathology Informatics, Vol 15, Iss , Pp 100358- (2024)
787	0		\|n http://www.sciencedirect.com/science/article/pii/S2153353923001724
787	0		\|n https://doaj.org/toc/2153-3539
856	4	1	\|u https://doaj.org/article/c86d09fefc36442d9aaafa5b84a339c4 \|z Connect to this object online.

Use of n-grams and K-means clustering to classify data from free text bone marrow reports

MARC

Similar Items