Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada

BackgroundThe use of social media data provides an opportunity to complement traditional influenza and COVID-19 surveillance methods for the detection and control of outbreaks and informing public health interventions.ObjectiveThe first aim of this study is to investigate the degree to which Twitter...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yuan Tian (Author), Wenjing Zhang (Author), Lujie Duan (Author), Wade McDonald (Author), Nathaniel Osgood (Author)
Format:	Book
Published:	Frontiers Media S.A., 2023-06-01T00:00:00Z.
Subjects:	article
Online Access:	Connect to this object online.
Tags:	Add Tag No Tags, Be the first to tag this record!

MARC


LEADER	00000 am a22000003u 4500
001	doaj_6e40fd3b0b0d40f0b25fb3fbc3faedc5
042			\|a dc
100	1	0	\|a Yuan Tian \|e author
700	1	0	\|a Wenjing Zhang \|e author
700	1	0	\|a Lujie Duan \|e author
700	1	0	\|a Wade McDonald \|e author
700	1	0	\|a Nathaniel Osgood \|e author
245	0	0	\|a Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada
260			\|b Frontiers Media S.A., \|c 2023-06-01T00:00:00Z.
500			\|a 2673-253X
500			\|a 10.3389/fdgth.2023.1203874
520			\|a BackgroundThe use of social media data provides an opportunity to complement traditional influenza and COVID-19 surveillance methods for the detection and control of outbreaks and informing public health interventions.ObjectiveThe first aim of this study is to investigate the degree to which Twitter users disclose health experiences related to influenza and COVID-19 that could be indicative of recent plausible influenza cases or symptomatic COVID-19 infections. Second, we seek to use the Twitter datasets to train and evaluate the classification performance of Bidirectional Encoder Representations from Transformers (BERT) and variant language models in the context of influenza and COVID-19 infection detection.MethodsWe constructed two Twitter datasets using a keyword-based filtering approach on English-language tweets collected from December 2016 to December 2022 in Saskatchewan, Canada. The influenza-related dataset comprised tweets filtered with influenza-related keywords from December 13, 2016, to March 17, 2018, while the COVID-19 dataset comprised tweets filtered with COVID-19 symptom-related keywords from January 1, 2020, to June 22, 2021. The Twitter datasets were cleaned, and each tweet was annotated by at least two annotators as to whether it suggested recent plausible influenza cases or symptomatic COVID-19 cases. We then assessed the classification performance of pre-trained transformer-based language models, including BERT-base, BERT-large, RoBERTa-base, RoBERT-large, BERTweet-base, BERTweet-covid-base, BERTweet-large, and COVID-Twitter-BERT (CT-BERT) models, on each dataset. To address the notable class imbalance, we experimented with both oversampling and undersampling methods.ResultsThe influenza dataset had 1129 out of 6444 (17.5%) tweets annotated as suggesting recent plausible influenza cases. The COVID-19 dataset had 924 out of 11939 (7.7%) tweets annotated as inferring recent plausible COVID-19 cases. When compared against other language models on the COVID-19 dataset, CT-BERT performed the best, supporting the highest scores for recall (94.8%), F1(94.4%), and accuracy (94.6%). For the influenza dataset, BERTweet models exhibited better performance. Our results also showed that applying data balancing techniques such as oversampling or undersampling method did not lead to improved model performance.ConclusionsUtilizing domain-specific language models for monitoring users' health experiences related to influenza and COVID-19 on social media shows improved classification performance and has the potential to supplement real-time disease surveillance.
546			\|a EN
690			\|a influenza
690			\|a COVID-19
690			\|a social media
690			\|a transformer-based language models
690			\|a digital surveillance
690			\|a Medicine
690			\|a R
690			\|a Public aspects of medicine
690			\|a RA1-1270
690			\|a Electronic computers. Computer science
690			\|a QA75.5-76.95
655	7		\|a article \|2 local
786	0		\|n Frontiers in Digital Health, Vol 5 (2023)
787	0		\|n https://www.frontiersin.org/articles/10.3389/fdgth.2023.1203874/full
787	0		\|n https://doaj.org/toc/2673-253X
856	4	1	\|u https://doaj.org/article/6e40fd3b0b0d40f0b25fb3fbc3faedc5 \|z Connect to this object online.

Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada

MARC

Similar Items