Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

BackgroundThere has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from...

Full description

Saved in:
Bibliographic Details
Main Authors: El Emam, Khaled (Author), Mosquera, Lucy (Author), Bass, Jason (Author)
Format: Book
Published: JMIR Publications, 2020-11-01T00:00:00Z.
Subjects:
Online Access:Connect to this object online.
Tags: Add Tag
No Tags, Be the first to tag this record!

MARC

LEADER 00000 am a22000003u 4500
001 doaj_97309c05f355472e9f9c84cfc7f11fa9
042 |a dc 
100 1 0 |a El Emam, Khaled  |e author 
700 1 0 |a Mosquera, Lucy  |e author 
700 1 0 |a Bass, Jason  |e author 
245 0 0 |a Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation 
260 |b JMIR Publications,   |c 2020-11-01T00:00:00Z. 
500 |a 1438-8871 
500 |a 10.2196/23139 
520 |a BackgroundThere has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. ObjectiveThe purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. MethodsA full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. ResultsThe meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. ConclusionsWe have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data. 
546 |a EN 
690 |a Computer applications to medicine. Medical informatics 
690 |a R858-859.7 
690 |a Public aspects of medicine 
690 |a RA1-1270 
655 7 |a article  |2 local 
786 0 |n Journal of Medical Internet Research, Vol 22, Iss 11, p e23139 (2020) 
787 0 |n http://www.jmir.org/2020/11/e23139/ 
787 0 |n https://doaj.org/toc/1438-8871 
856 4 1 |u https://doaj.org/article/97309c05f355472e9f9c84cfc7f11fa9  |z Connect to this object online.