Next: Why the DeCODE Proposals Up: The DeCODE Proposal for Previous: Introduction

When are de-identified data not anonymous?

Firstly, although it is not too difficult to de-identify data that provide only a time-limited snapshot of a population's health - such as the data which health services use to compile monthly management statistics of numbers of operations, consumption of drugs and the like - it is effectively impossible to de-identify longitudonal records, that is, records which link together all (or even many) of the health care encounters in a patient's life. Someone wishing to abuse the database to investigate a business or political rival, for example, is likely to know some facts about the target of investigation (that he broke his ankle playing football on the 14th October 1974, that he was absent from Iceland for 1978-1982 doing postgraduate work, and so on) and wish to know other facts (such as whether he has ever been treated for alcoholism or for psychiatric disorders). In many cases, the known facts will enable the target patient to be identified despite the use of a pseudonym in the database itself [5] [6].

For this reason, a database of longitudonal medical records must be considered to be personal health information; although some of the patients may be protected by the use of pseudonyms, many will not be. In a database which also contains genealogies, individuals will be even more easy to identify; one could search for people with four uncles, two aunts, eight great-uncles, seven great-aunts, etc, and if the data for several generations are available then most groups of siblings could be identified.

This point - that the database contains identifiable medical information - was readily conceded by DeCODE management on the 12th October during a briefing at the Medical Association [4], although a subsequent press release claimed that the concession had not been made [7].

So those countries whose health services maintain central databases of medical records, such as New Zealand, do not consider pseudonyms to be enough protection. There are also stringent use controls. The New Zealand system, as noted above, limits access to a small group of health service statisticians, limits the type of enquiry that can be made, and rejects any enquiry which would be answered by reference to the records of less than six patients [8].

There is a large literature on such mechanisms, or `inference security' as the subject is called. The basic ideas were initially developed by the world's census bureaux to prevent statistical enquiries made of census databases being abused in ways that could leak information about identifiable individuals. It is of critical and growing interest to medical research organisations as well, and is being actively promoted by data protection authorities in Europe and elsewhere [9]. The standard introductory textbook is [10].

It is not sufficient to merely require that enquiries be based on a minimum size of query set; one must also ensure that combinations of queries cannot be used to identify individuals. For example, it might be possible to make one enquiry about the target plus ten other individuals, and a second about the ten others (see [10] for many more complex examples and powerful attack techniques). Common protection mechanisms include logging and analysing queries, adding noise to the underlying data, making each query depend on a pseudorandomly selected fixed size subset of the data, and suppressing records with particular data values (such as census records indicating very high incomes, or in the medical context, subscriptions for AZT). None of these techniques will prevent all possible inference attacks, and whether a system provides an adequate level of protection depends closely on the nature of the application.

Systems that use de-identified data fall into two broad categories. In the first, the data are processed once and for all to remove identifiers and then released for arbitrary processing by untrusted programs. An example of this is given by the US census, which has in the past distributed a tape containing the record of one household in every thousand, with the names and exact geographical locations removed. In the second, the data are held in a trusted system and only a restricted set of enquiries are permitted; an example of this is the New Zealand medical records system mentioned above.

In both kinds of system, effective de-identification depends on detailed knowledge of the application. For example, I recently evaluated on behalf of the BMA a proposed system for collecting de-identified data from pharmacies for resale to drug companies. In this case it was required to protect the privacy of doctors as well as patients. The original design had proposed grouping doctors into cells of about 20 doctors, within which they would be known as `doctor 1', `doctor 2', and so on. The system would provide total prescriptions of each drug per week. However, it was possible for an experienced drug salesman to look at the figures and say, for example, ``Doctor 7 must be Susan Smith, because she went skiing during the second week of February, and look at the drop off in prescriptions there.'' So the system had to be redesigned to show percentage market share rather than absolute volumes (and with other controls as well).

The above prescription system is of the first kind (pre-process then release). The DeCODE proposal is of the second kind; the data held in the database are in many cases identifiable, and privacy depends on the mechanisms used to restrict queries. This makes it necessary to control the kind of programs which an enquirer can run on the database. For example, if the system merely compelled enquiry programs to read at least ten records, then an attacker who wished to find out about a target patient might write a program which read the target patient's record and those of nine others selected at random, and then returned the value `1' if the target were an alcoholic, `2' if he had received psychiatric treatment, `3' if both and `0' if neither. For this reason, arbitrary enquiries should not be permitted; the database user must not have access to a query language that is Turing powerful (this is a well known concept in computer science for describing a language that is as powerful as a general computer, in the sense that one may write arbitrary programs in it.)

Next: Why the DeCODE Proposals Up: The DeCODE Proposal for Previous: Introduction

Ross Anderson
1998-10-20