Test-Datasets and EU Liability – mixed law/IT

Stuart Ritchie


Here is a modified version of a question I’m about to answer on Quora. Warning: as it says on the tin, this post is both law and IT. Don’t say you weren’t warned…

Can copies of postcode, gender and age data be used in testing without violating the Data Protection Act or the GDPR?

Such questions are becoming more and more frequent, as people begin to panic with the realization that two decades of previously unlawful behavior is now going to get the treatment it deserves. Regrettably, I am unpersuaded by others’ answers. So it’s time I gave readers the dubious benefit of my own.

The headline answer, as so frequently is the case, is the lawyer’s old reliable answer: "Yes, but".

Let’s start with the business (and personal security) problem, and why savage punishment will be fully deserved. Imagine what power you might have when you stumble across, in test data, the exact location and exact time (to the minute) of many Persons of Interest in, say, 3–12 months time. Including a serving head of government, a household-name Nobel prize-winner, assorted household-name celebs, business leaders, etc. And it’s all just sitting there saying "sell me to people to whom this might be of interest".

This is not fantasy. Just my best test data anecdote, from a single test dataset I requested and received on a gig about 20 years ago. It wasn’t even a large dataset, only about 5–10 million people. I didn’t even search for anyone, I just stumbled across them dealing with knotty exceptions while tuning my multi-feed ETL false positive/false negative algorithms. I wasn’t even an employee of the company, and certainly none of the potential targets were informed that I had my grubby mitts on their schedules (and might be minded to sell it to interested groups in the Middle East .

Ok, let’s turn to the legal basics here.

Starting proposition since 1995: all personal data processing is unlawful by default.

So you have to prove it’s lawful. Otherwise you’re wide open to fines and, more recently, existential class actions.

How do you prove it’s lawful? Easy. Identify all processes – noting that third-country transfers count as an extra process. For each, identify the purpose (singular not plural: the more purposes you pick, the more fun the lawyers will have with you). For each process and purpose, pick one of the six "legal bases" (or "pathways to righteousness" as evocatively described by one Board who "got it" after a recent presentation): always the same six, recently restated in Articles 6(1)(a) through (f) of the GDPR, enacted May 4, 2016. Consent, data subject contract, statutory obligation, vital interest (i.e. saving lives), public interest (desperate), or legitimate interests (even more desperate if you’re cavalier about it, try looking at WP29 Opinion 6/2014). If you have an EPA (enterprise privacy architecture) that’s been built with or checked by lawyers, this is not difficult. But this one probably will be one of the more plausible legitimate interests.

Make it not just plausible, but legally provable. Why? Because it’s public information, and…

…Proposition from May 25, 2018: The legal burden of proof is on defendants, not plaintiffs. They don’t have to prove a positive. You have to prove a negative. Why is this important? Ask any tort trial lawyer…

Pro tip: by May 2018, replace all your Business-Speak privacy "policies" (which are little more than circles you’ve painted on yourself) and Notices with sui generis-dataset-targeted Article 7/13/14 Notifications and consents – without legalese and ideally less than a single A4 page of simple text. Why? because otherwise regulators or data subjects can detect breach in seconds; they don’t even need to investigate you or ask you questions prior to fining or suing you: and you’ll have the onerous legal task of proving your own public statements wrong, threatening to make the entire case an open misere and thus misery. You’re loading the weapon, putting a round up the spout, taking the safety catch off, handing it to the supervisors and data subjects, and saying "Please Shoot Me!" What do you think they’ll do?

[EDIT: for an example of how compliance might be achieved, take a look at this site’s Notifications as generated automatically from our Article 30 processing records. I leave gap analyses against others’ Notifications as an exercise for the reader!]

Is notification/consent compliance possible? Of course it is. I do it all the time within my GDPR courses, which in consequence I suspect are the only such courses in the world that themselves are compliant with… the GDPR. Sometimes it’s fun to be smug and self-righteous.

Technical Alternative

But for test data you can avoid all that legal stuff. There’s a simple technical solution. But first, here are two solutions that everyone seems to push. Arguably, both necessarily fail the legal tests if the judge is switched-on:

    1. Anonymization. See for example the lower court judgment in Vidal-Hall v Google, on which I’ve posted ad nauseam.
    2. Pseudonymization aka "data masking". Don’t be fooled by either description. No matter what the GDPR says, pseudonymization is relatively easy to crack by motivated people and usually will be cracked by the next data science algorithm (created by motivated people) bus that comes along. It therefore fails the GDPR test "state of the art" at the latest when someone publishes the crack, as well as the proportionality test because the risk is higher than sky-high and the technical measure is too weak. Pseudonymize by all means to support individual subject access requests, or for related matters such as portability implementation (as I noted in my submission to WP29 on the right to portability at 31–36 [For legal reasons (no not defamation or litigation!), and with apologies, I temporarily have withdrawn publication of this submission.]) if you think it’s proportionate, but don’t even think about relying upon it for test datasets (or non-test datasets) unless breach liability doesn’t matter to you…

The technical solution that in my view most likely will avoid breach is: in memory prior to writing the test dataset, kill off all the personally identifiable data (note to US techies and lawyers this is NOT the same thing as PII as variously defined in the US states etc), replacing it with arbitrarily randomized strings created algorithmically to test appropriate use cases (thus best data-driven from use-case metadata) but maybe headed by a QA-friendly heuristic 1-character alphanumeric metadata prefix indicating what type of personal data the string represents (assuming a choice of 62 types of data is sufficient; that said my own privacy metadata taxonomy is larger than that so by all means go to two characters). There are other subtleties for, say, RDBMS because of referential integrity etc and you’ll thus require some large in-memory "duplication tables", but I leave those as exercises for the reader.

    All facts and opinions set out above, including but not limited to spoken content and attachments/links, are provided for informational purposes only as a non-legal service to the public, and do not constitute legal advice or a substitute for legal counsel, and do not create any lawyer-client relationship, nor do they constitute advertising or provision of a legal service, nor are they the opinion of this web site or of its owner. The author is a co-founder of GDPR360 but does not speak in that capacity or in the capacity of a lawyer.