Database Anonymization. David Sánchez
one right foot, thus modifying his posterior belief about individuals to a great extent. In the second example, the use of auxiliary information makes things worse. Suppose that a statistical database teaches the average height of a group of individuals, and that it is not possible to learn this information in any other way. Suppose also that the actual height of a person is considered to be a sensitive piece of information. Let the attacker have the following side information: “Adam is one centimeter taller than the average English man.” Access to the database teaches Adam’s height, while having the side information but no database access teaches much less. Thus, Dalenius’ view of privacy is not feasible in presence of background information (if any utility is to be provided).
The privacy criteria used in practice offer only limited disclosure control guarantees. Two main views of privacy are used for microdata releases: anonymity (it should not be possible to re-identify any individual in the published data) and confidentiality or secrecy (access to the released data should not reveal confidential information related to any specific individual).
The confidentiality view of privacy is closer to Dalenius’ proposal, being the main difference that it limits the amount of information provided by the data set rather than the change between prior and posterior beliefs about an individual. There are several approaches to attain confidentiality. A basic example of SDC technique that gives confidentiality is noise addition. By adding a random noise to a confidential data item, we mask its value: we report a value drawn from a random distribution rather than the actual value. The amount of noise added determines the level of confidentiality.
The anonymity view of privacy seeks to hide each individual in a group. This is indeed quite intuitive a view of privacy: the privacy of an individual is protected if we are not able to distinguish her from other individuals in a group. This view of privacy is commonly used in legal frameworks. For instance, the U.S. Health Insurance Portability and Accountability Act (HIPAA) of 1996 requires removing several attributes that could potentially identify an individual; in this way, the individual stays anonymous. However, we should keep in mind that if the value of the confidential attribute has a small variability within the group of indistinguishable individuals, disclosure still happens for these individuals: even if we are not able to tell which record belongs to each of the individuals, the low variability of the confidential attribute gives us a good estimation of its actual value.
The Health Insurance Portability and Accountability Act (HIPAA)
The Privacy Rule allows a covered entity to de-identify data by removing all 18 elements that could be used to identify the individual or the individual’s relatives, employers, or household members; these elements are enumerated in the Privacy Rule. The covered entity also must have no actual knowledge that the remaining information could be used alone or in combination with other information to identify the individual who is the subject of the information. Under this method, the identifiers that must be removed are the following:
• Names.
• All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geographical codes, except for the initial three digits of a ZIP code if, according to the current publicly available data from the Bureau of the Census:
– The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people.
– The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people are changed to 000.
• All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.
• Telephone numbers.
• Facsimile numbers.
• Electronic mail addresses.
• Social security numbers.
• Medical record numbers.
• Health plan beneficiary numbers.
• Account numbers.
• Certificate/license numbers.
• Vehicle identifiers and serial numbers, including license plate numbers.
• Device identifiers and serial numbers.
• Web universal resource locators (URLs).
• Internet protocol (IP) address numbers.
• Biometric identifiers, including fingerprints and voiceprints.
• Full-face photographic images and any comparable images.
• Any other unique identifying number, characteristic, or code, unless otherwise permitted by the Privacy Rule for re-identification.
2.4 DISCLOSURE RISK IN MICRODATA SETS
When publishing a microdata file, the data collector must guarantee that no sensitive information about specific individuals is disclosed. Usually two types of disclosure are considered in microdata sets [44].
• Identity disclosure. This type of disclosure violates privacy viewed as anonymity. It occurs when the intruder is able to associate a record in the released data set with the individual that originated it. After re-identification, the intruder associates the values of the confidential attributes for the record to the re-identified individual. Two main approaches are usually employed to measure identity disclosure risk: uniqueness and reidentification.
– Uniqueness. Roughly speaking, the risk of identity disclosure is measured as the probability that rare combinations of attribute values in the released protected data are indeed rare in the original population the data come from.
– Record linkage. This is an empirical approach to evaluate the risk of disclosure. In this case, the data protector (also known as data controller) uses a record linkage algorithm (or several such algorithms) to link each record in the anonymized data with a record in the original data set. Since the protector knows the real correspondence between original and anonymized records, he can determine the percentage of correctly linked pairs, which he uses to estimate the number of re-identifications that might be obtained by a specialized intruder. If this number is unacceptably high, then more intense anonymization by the controller is needed before the anonymized data set is ready for release.
• Attribute disclosure. This type of disclosure violates privacy viewed as confidentiality. It occurs when access to the released data allows the intruder to determine the value of a confidential attribute of an individual with enough accuracy.
The above two types of disclosure are independent. Even if identity disclosure happens, there may not be attribute disclosure if the confidential attributes in the released data set have been masked. On the other side, attribute disclosure may still happen even without identity disclosure. For example, imagine that the salary is one of the confidential attributes and the job is a quasi-identifier attribute; if an intruder is interested in a specific individual whose job he knows to be “accountant” and there are several accountants in the data set (including the target individual), the intruder will be unable to re-identify the individual’s record based only on her job, but he will be able to lower-bound and upper-bound the individual’s salary (which lies between the minimum and the maximum salary of all the accountants in the data set). Specifically, attribute disclosure happens if the range of possible salary values for the matching records is narrow.
2.5 MICRODATA ANONYMIZATION
To avoid disclosure, data collectors do not publish the original microdata set X, but a modified version Y of it. This data set Y is called the protected, anonymized, or sanitized version of X. Microdata protection methods can generate the protected data set by either masking the original data or generating synthetic data.
• Masking. The protected