Database Anonymization. David Sánchez
literature); (iii) categorical [26, 57] vs. continuous (the rest of the references cited in this paragraph). Also, depending on whether they deal with one or several attributes at a time, microaggregation methods can be classified into univariate and multivariate.
• Univariate methods deal with multi-attribute data sets by microaggregating one attribute at a time. Input records are sorted by the first attribute, then groups of successive k values of the first attribute are created and all values within that group are replaced by the group representative (e.g., centroid). The same procedure is repeated for the rest of the attributes. Notice that all attribute values of each record are moved together when sorting records by a particular attribute; hence, the relation between the attribute values within each record is preserved. This approach is known as individual ranking [16, 17]. Individual ranking just reduces the variability of attributes, thereby providing some anonymization. In [25] it was shown that individual ranking causes low information loss and, thus, its output better preserves analytical utility. However, the disclosure risk in the anonymized output remains unacceptably high [22].
• To deal with several attributes at a time, the trivial option is to map multi-attribute data sets to univariate data by projecting the former onto a single axis (e.g., using the sum of z-scores or the first principal component, see [16]) and then use univariate microaggregation on the univariate data. Another option avoiding the information loss due to single-axis projection is to use multivariate microaggregation able to deal with unprojected multi-attribute data [21]. If we define optimal microaggregation as finding a partition in groups of size at least k such that within-groups homogeneity is maximum, it turns out that, while optimal univariate microaggregation can be solved in polynomial time [41], unfortunately optimal multivariate microaggregation is NP-hard [70]. This justifies the use of heuristic methods for multivariate microaggregation, such as the MDAV (Maximum Distance to Average Vector, [20, 26]). In any case, multivariate microaggregation leads to higher information loss than individual ranking [25].
To illustrate these approaches, we next give the details of the MDAV heuristic algorithm for multivariate fixed group size microaggregation on unprojected continuous data.
1. Compute the average record x̄ of all records in the data set. Consider the most distant record xr to the average record x̄ (using the squared Euclidean distance).
2. Find the most distant record xs from the record xr considered in the previous step.
3. Form two groups around xr and xs, respectively. One group contains xr and the k – 1 records closest to xr. The other group contains xs and the k – 1 records closest to xs.
4. If there are at least 3k records which do not belong to any of the two groups formed in Step 3, go to Step 1 taking as new data set the previous data set minus the groups formed in the last instance of Step 3.
5. If there are between 3k – 1 and 2k records which do not belong to any of the two groups formed in Step 3: a) compute the average record x of the remaining records; b) find the most distant record xr from x̄ c) form a group containing xr and the k – 1 records closest to xr; d) form another group containing the rest of records. Exit the algorithm.
Конец ознакомительного фрагмента.
Текст предоставлен ООО «ЛитРес».
Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.
Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.