Is Anonymized Data Truly Safe From Re-Identification? Maybe not.

Published on JD Supra on August 5, 2019

Across all industries, data collection is ubiquitous. One recent study estimates that over 2.5 quintillion bytes of data are created every day, and over 90% of the data in the world was generated over the last two years. Not surprisingly, the proliferation of data collection has been an impetus for the increased regulatory scrutiny on the collection and use of personal data.

Companies rely on data anonymization both to maximize the utility and value of the personal data collected and to comply with privacy regulations. Although data protection regulations vary, data that meets the de-identification or anonymization requirements of the applicable regulation is not considered personal data, and thus exempt from privacy regulations such as the California Consumer Privacy Act (CCPA) and the European General Data Protection Regulation (GDPR). For example, the CCPA does not restrict a business’s ability to collect, use, retain, sell, or disclose consumer information that is de-identified or aggregated, and GDPR does not apply to personal data that has been anonymized. Consequently, a company that fails to adequately de-identify or anonymize data may violate the CCPA or GDPR with respect to its use of personal or consumer data.

The most recent research regarding re-identification of data sets that have been anonymized indicates that current anonymization techniques are often ineffective at protecting individuals against re-identification. A recent study published in Nature Communications (Estimating the success of re-identifications in incomplete datasets using generative models) found that 99.98% of Americans could be re-identified from any anonymized data set that uses only 15 demographic attributes. In addition, the researchers found that even if an anonymized data set is “heavily incomplete,” they could still estimate the likelihood of correctly re-identifying an individual with high accuracy and rejected the argument that the incompleteness of most data sets reduces the risk of re-identification.

The study posits that many anonymized data sets may not meet the requirements of GDPR or the CCPA, and calls into question whether the current release-and-forget model of anonymization is adequate. For example, Recital 26 of GDPR defines anonymous information as “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable (emphasis added).”  Recital 26 further provides that in order to determine whether “means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.”  Consequently, if data subjects are identifiable by collecting a small number of attributes, and available technology removes significant hurdles to re-identification,  companies should consider whether their current anonymization practices fail to meet the anonymization standard that GDPR (and other privacy regulations) prescribe for anonymization or de-identification.

The researchers also made an  online tool available that allows individuals to see the likelihood of being re-identified from anonymized data by plugging in a few common demographic characteristics. (On average, individuals have an 83% chance of  being re-identified if gender, birth date and ZIP code are known.) The tool also allows individuals to include additional basic demographic characteristics to see the increased likelihood of identification.

Although numerous prior studies have established that data anonymization is often reversible, the latest study demonstrates that technological advances have made it possible to de-anonymize data that might not have been previously possible, and it is becoming increasingly difficult to truly de-identify a data set and thus satisfy the requirements of privacy laws such as GDPR and the CCPA.