Quasi identifiers and the challenges of anonymising data

“…oh, no don’t worry, we’ve anonymised the data so it’s no problem. We’ve removed all the identifying information, we’ve removed the name and personal number.” – Far too common.

On May 25 2018 the European Union general data protection regulation(GDPR) becomes law in all EU member states. Among other things it requires both data controllers – the people who collect and process personal data – and data providers – those who supply infrastructure to store and process personal data – to take continuous measures to ensure that personal data don’t fall into the wrong hands.

The requirements set by the law are neither ground breaking nor excessive from a technical standpoint. For those of us working in security, the requirements are meat and potatoes:

Taking into account the state of the art, the costs of implementation and the nature, scope, context and purposes of processing as well as the risk of varying likelihood and severity for the rights and freedoms of natural persons, the controller and the processor shall implement appropriate technical and organisational measures to ensure a level of security appropriate to the risk, including inter alia as appropriate:

(a) the pseudonymisation and encryption of personal data;

(b) the ability to ensure the ongoing confidentiality, integrity, availability and resilience of processing systems and services;

(c) the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident;

(d) a process for regularly testing, assessing and evaluating the effectiveness of technical and organisational measures for ensuring the security of the processing. (Article 32 “General Data Protection Regulation” Regulation (EU) 2016/679)

This is what the IT-security industry has been preaching since forever. Nothing new under the sun – Except: Now it is written into law, at least concerning personal data, and with a side of hefty fines.

Much can and will be said about the GDPR, about which impact it will have on society from an IT-security stand point. But in this blog post we would like to specifically address one, at least in our experience, common retort when it comes to securing platforms handling personal data:

“No need, we have anonymised the data”

Now, if the data is truly anonymised – of course – then it is no longer personal data and the GDPR doesn’t apply (However securing your platform may still be a good idea). But more often than not, data that is thought to be anonymised is discovered not to be. This is the reason that the legislation instead uses the term pseudonymised – given a false name.

Identifiers

Identifiers are those attributes that can be used to directly identify a person. A name or personal number are prime examples.

The GDPR removes a few grey areas when it comes to identifiers. For instance it makes it clear that technical and online identifiers indeed are identifiers and thus personal data. So for instance log files containing IP-numbers, IMEI-numbers etc. contain personal data and need to be handled appropriately.

Quasi-identifiers

Quasi-identifiers are a set of attributes that can be used to identify a person indirectly. The main purpose of an identifier (like a name or personal number) is to identify a person. The main purpose of a quasi-identifier however is not to identify a person, but it is possible to identify a person using it.

Quasi-identifiers are attributes that within a set of other quasi-identifiers are unique to a single individual. Which these quasi-identifiers are, may vary from person to person depending on how rare the attribute is or how rare the combination of attributes are.

An example these attributes:

  • Age
  • Occupation
  • Municipality (sv. Kommun)
  • Gender

These are enough to uniquely identify approximately 1% of the Swedish population – 85% are identified down to a group of 256 individuals.
(Flashover based on SCB tables: Anställda 16-64 år med bostad i regionen (nattbef) efter län, yrke (3-siffrig SSYK 2012), ålder och kön. År 2014 and Folkmängden efter region, civilstånd, ålder och kön. År 1968 – 2015)

In order to further distinguish between these 256 individuals within the group, only 8 bits of information is needed. That is, a unique set of 8 likes/dislikes or a unique set of approximately 3,5 star ratings.

In fact – the attributes age, occupation, municipality or gender are not even needed if a unique set of 10 star ratings or 21 likes/dislikes is used.

How quasi-identifiers can be used to identify an individual

Quasi-identifiers cannot be used to directly identify an individual (then they would be identifiers) but instead they can be used to find the same individual in another dataset – where the user is identified.

If the set of attributes is unique to an individual and the same set of attributes is present elsewhere – the quasi-identifiers can be used to link the two individuals together across the two datasets and thereby establishing the identity.

quasi-identifier_basalt

The scary part is that it doesn’t even need to be the same set of attributes that are unique for every individual. There can exist some attributes that are uniquely identifies one individual, and another set of attributes that uniquely identifies another individual.

And whether or not it is possible to identify an individual based on these identifiers depends on external data. This makes it very difficult to anonymise data.

“But if it is dependent on external data surely it is not my responsibility that the data is identifiable? If someone tweets these quasi-identifiers and thereby makes themselves identifiable in my data that must be their problem?”

The problem here is that the individuals will not know which data is uniquely identifiable to them, and even if they do, they may agree to publish that they like Piña Coladas and football, but at the same time not agree to publish all other data that you may have collected on them.

There will be more blog posts to follow on the topic of anonymization techniques and quasi-identifiers. In the next part we will go over some anonymization techniques and where they can go wrong. Please also read the news post about our engagement at the conference.

Attribution:

  • Person Icon Creative Commons Attribution 3.0 Björn Andersson
  • Gymnasium icon Public domain Designed by Iconathlon as a collaboration among Edward Boatman, Mike Clare and Jessica Durkin.
  • Game controller Edward Boatman  Creative Commons Attribution 3.0
  • Coconut Iain Hector  Creative Commons Attribution 3.0 United States
  • Controller Icon Nicereddy  Creative Commons CC0 1.0 Universal Public Domain Dedication
  • SAFF Championship 2011 Ahmad Faisal Creative Commons Attribution-Share Alike 3.0 Unported
  • The thumb graphic from the Facebook “like” button, Enoc vt, This image of simple geometry is ineligible for copyright and therefore in the public domain, because it consists entirely of information that is common property and contains no original authorship.

The authors (presentation in Swedish):

Kristoffer Arvidson:
Kristoffer arbetar som lösningsarkitekt på Basalt AB och är i grunden en fullstack .NET-utvecklare som började sina banor inom webbutveckling för att sedan gå över till utveckling inom desktop, client-serverlösningar. Kristoffer älskar ny teknik och har pysslat med allt från att bygga kretskort baserade på pickar, egna larmsystem och smarta hem och IOT-lösningar innan ordet IOT ens var uppfunnet, till att bygga egna CRM-system i ASP, designa kompletta IT-system för stora företag och att utveckla applikationer som används för att hantera våra sopor och renhållning. Han älskar dessutom utvecklingsmetodik och säkerhet. Kristoffer är en arkitekt som ser helheten och med sin breda tekniska bakgrund ser allt från slutanvändaren till säkerheten och förvaltningsbarheten i ett system. Idag arbetar Kristoffer med integrationslösningar för samhällskritisk verksamhet och Försvarsmakten. Kristoffer är en van talare.

Patrick Bladh:
Patrick har i flera år arbetat med arkitektur och utveckling av IT-system med särskilda krav på säkerhet. Han har programmerat sedan barnsben och spenderar gärna sin fritid med lödkolven i högsta hugg. Är något av en folieh.. säkerhetsnörd och uppskattar en bra CTF då och då. Idag är han säkerhetsarkitekt på Basalt AB som bygger IT-system för samhällsviktig verksamhet. Utöver arkitektrollen genomför han även säkerhetsgranskningar och penetrations- och sårbarhetstestning på konsultbasis.