Your data has been de-identified. You are anonymous. We have pseudonymised your information. Your private data is masked. The Personally Identifiable Information (PII) has been scrubbed.

What if I told you there’s a good chance none of that matters?

A research paper from Carnegie Mellon University back in the year 2000 showed that:

87% (216 million of 248 million) of the population in the United States [were uniquely identifiable] based only on [postcode], gender and date of birth.

Three pieces of seemingly non-PII data were enough to identify 9 out of 10 people in the US. A subsequent study on similar datasets by Stanford PARC arrived at a 63% identification result which, whilst lower, doesn’t seem much less terrifying.

We’ve known about this statistical oddity for many years, that we significantly over-estimate the amount of data required to uniquely identify us as individuals within a cohort and yet it is still a routine gap in our organisational data and privacy strategies because it just doesn’t sound right. Our gut tells us that we can give away a few bits of vague information and still remain anonymous.

Our gut is wrong.

These older studies focused primarily on demographic data, the most populus data collected at that point in history, however there are many more methods of individualised data available to the would-be re-identifiers today.

Browser fingerprinting (the most used way of web identification in an increasingly cookie-less world), mouse movement and finger tracking, invisible pixel tracking, all of these things are routine on the internet on most of the major sites you’d visit. Then there are the permissions you’re giving to your apps - location/GPS data, other apps installed, contact information, call/text/message logs - all of these methods are now more common than not and they’re deeply enriching the dataset that is available on you, and unique to only you.

Now that we know what is out there, how can we cut through the crap of the terminology used by companies who try to convince us that they’re keeping our information de-identified.

What techniques are used to de-identify data?

There are several differing techniques and pieces of terminology that are used when referencing the act of data de-identification. They are:

Hashing - Hashing is the process of replacing a value (eg phone number) with an algorithmically generated token via a one-way algorithm. A key feature of tokenisation is that it is made to be repeatable (eg the phone number 12345678 will always hash to the same value). These hashing algorithms are “one way” algorithms, in that they can’t be “un-hashed” to reveal the initial value. The most common example are algorithms used to hash the passwords that you use to login to websites and apps. The website will store the “hashed” tokenised version of your password so will be able to know whether you’ve entered your correct password without ever actually knowing what your password is (they compare the hash of your tokenised password to the has they have stored on their side).

Redacting or Pseudonymisation - Redacting will take information and replace it with padded information. Most redacting is done in a way that can’t be linked after the redacting is done - for example we could replace the last name “Brown” with “XXXXX” and the last name “Ashok” with “XXXXX”.

Aggregating - By grouping cohorts or portions of data together in an aggregate you can maintain the statistical usefulness of the data but lose the individual grain that required personal details. When looking at demographic details across a state you could aggregate information into postcodes to assist in analysis without requiring the individual line records.

Generalising - You can remove some of the focus from specific identifiable variable to assist in de-identification such as removing the day-of-month from a Date of Birth and only keeping month/year or storing Age instead of Date of Birth.

Suppressing / Removing - Many variables are simply dropped in the process of de-identification. If the data is not required for the purpose of evaluation or analysis it may be removed without any impact to the resulting analysis, however reducing the vector of identification on a record.

Adding Noise - A relatively niche method however increasingly used by privacy focused companies (Apple uses a lot of noise techniques in their analytics) this method intentionally adds noise (ie rubbish data) in a way that will not impact the overall analytical results of the dataset however making it less useful for identifying an individual as the noise acts as a “red herring” in re-identification processes.

This is far from a complete list, but it covers most of the methods of de-identification you’ll see out in the wild.

Why is re-identification a growing problem?

Re-identification is a valuable science for the simple reason that it’s much easier to influence an individual than it is to influence a group - if I know who you are, your desires, your preferences, your demographics, then I’m able to message to you with a high level of personal persuasion. This is the less nefarious reason for the significant investment organisations have made in re-identification.

It can also lead to much less tolerable instances - if I want to sell you insurance and I can stitch together and re-identify a bunch of publicly available or purchasable information combined with the information you provided me in your application I will end up with a model that gives me a complete picture of you including health records, medical predispositions, evidence of risk taking behaviour, education transcripts, social circle and relationships, employment history, income and spending habits such as gambling etc - I could create a very specific risk profile for just you, but I have also created a dangerous digital representation of everything about you, destroying your right to privacy and potentially influencing your ability to access services in the future (insurance, financial services, employment etc).

I have also effectively taken away your consent by stitching many individual consents together. Think how many times you have clicked “I Agree” or filled in a form with small pieces of information - did you ever believe you were consenting to that information being brought together in one big “You” repository? Most definitely not. And now that we’re carrying around devices that track our every movement and action, the richness of this information can be downright scary.

Have you entered into a government consulate building?

Have you gone to a psychological respite facility?

This is happening, and it’s big business. The Data Brokerage industry within Australia is booming and continues to expand despite the ACCC investigating. The initial draft Report on Data Brokers can be seen here and is now awaiting response from brokers until mid 2024 however the ACCC Chair already foreshadowing their findings by stating:

“Australian consumers may not be aware that their information is being collected, stored and sold by third-party data brokers with whom they have no direct relationship."

Your data is a commodity and successful re-identification methods mean that a lot of the “anonymisation” or “de-identification” techniques used by Data Brokers to tick the regulatory boxes don’t protect you or provide you with anonymity or privacy.

How do we fix it?

In 2 words - legislation and regulation.

Waiting for everyone to be good corporate citizens is not the answer - the potential rewards for tracking, data commodification and data re-identification are too lucrative to ignore unless the appropriate legislation and regulation is in place as a strong deterrent.

Legislation is always a lagging process and nowhere is this more obvious than explosive technology (such as Privacy, Data and AI in the last 5 years). Cooperation from industry and strong consumer and privacy advocacy must continue to be a stop-gap measure as the legislation permeates across the globe. In the immediate term it is incumbent on us as data professionals to help advise and educate the general public on the concerns and awareness they need to have in relation to data privacy and the possibilities when it comes to data re-identification.

Play out this scenario with your data citizens and business users, surprise them with the statistics of how easy it is to re-identify data even after their consent has been provided in piecemeal, remind them that legislation and regulation has not caught up to the technology landscape yet. Inform them - education is the best safeguard they have.

After all, once their data is out there, re-identified, linked, and on-sold it’s nigh impossible to put the cat back in the bag.