Anonymization Algorithms - Microaggregation and Clustering.ppt
《Anonymization Algorithms - Microaggregation and Clustering.ppt》由会员分享,可在线阅读,更多相关《Anonymization Algorithms - Microaggregation and Clustering.ppt(55页珍藏版)》请在麦多课文档分享上搜索。
1、Anonymization Algorithms - Microaggregation and Clustering,Li XiongCS573 Data Privacy and Anonymity,Anonymization using Microaggregation or Clustering,Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonym
2、ity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007,Anonymization MethodsPerturbative: distort the data statistics computed on the perturbed dataset should not differ
3、significantly from the original microaggregation, additive noise Non-perturbative: dont distort the data generalization: combine several categories to form new less specific category suppression: remove values of a few attributes in some records, or entire records,Types of data Continuous: attribute
4、 is numeric and arithmetic operations can be performed on it Categorical: attribute takes values over a finite set and standard arithmetic operations dont make sense Ordinal: ordered range of categories , min and max operations are meaningful Nominal: unordered only equality comparison operation is
5、meaningful,Measure tradeoffsk-Anonymity: a dataset satisfies k-anonymity for k 1 if at least k records exist for each combination of quasi-identifier valuesassuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures,Satisfying k-anonymity using
6、 generalization and suppression is NP-hard Computational cost of finding the optimal generalization How to determine the subset of appropriate generalizations semantics of categories and intended use of data e.g., ZIP code: 08201, 08205 - 0820* makes sense 08201, 05201 - 0*201 doesnt,Critique of Gen
7、eralization/Suppression,Problems cont. How to apply a generalization globally may generalize records that dont need it locally difficult to automate and analyze number of generalizations is even larger Generalization and suppression on continuous data are unsuitable a numeric attribute becomes categ
8、orical and loses its numeric semantics,Problems cont. How to optimally combine generalization and suppression is unknown Use of suppression is not homogenous suppress entire records or only some attributes of some records blank a suppressed value or replace it with a neutral value,Microaggregation/C
9、lustering,Two steps: Partition original dataset into clusters of similar records containing at least k records For each cluster, compute an aggregation operation and use it to replace the original records e.g., mean for continuous data, median for categorical data,Advantages: a unified approach, unl
10、ike combination of generalization and suppression Near-optimal heuristics exist Doesnt generate new categories Suitable for continuous data without removing their numeric semantics,Advantages cont.,Reduces data distortion K-anonymity requires an attribute to be generalized or suppressed, even if all
11、 but one tuple in the set have the same value. Clustering allows a cluster center to be published instead, “enabling us to release more information.”,Original Table,2-Anonymity with Generalization,Generalization allows pre-specified ranges,2-Anonymity with Clustering,Cluster centers (27,70, 37,115)
12、published,27=(25+27+29)/3 70=(50+60+100)/3,37=(35+39)/2 115=(110+120)/2,Another example: no common value among each attribute,Generalization vs. clustering,Generalized version of the table would need to suppress all attributes. Clustered Version of the table would publish the cluster center as (1, 1
13、, 1, 1), and the radius as 1.,Anonymization using Microaggregation or Clustering,Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anon
14、ymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007,Multivariate microaggregation algorithm MDAV-generic: Generic version of MDAV algorithm (Maximum Distance to Average Vector) from previous papers Works with any type of data (continuous
15、, ordinal, nominal), aggregation operator and distance calculation,MDAV-generic(R: dataset, k: integer)while |R| 3k1. compute average record x of all records in R2. find most distant record xr from x3. find most distant record xs from xr4. form two clusters from k-1 records closest to xr and k-1 clo
16、sest to xs5. Remove the clusters from R and run MDAV-generic onthe remaining datasetend whileif 3k-1 |R| 2k1. compute average record x of remaining records in R2. find the most distant record xr from x3. form a cluster from k-1 records closest to x4. form another cluster containing the remaining rec
17、ordselse (fewer than 2k records in R) form a new cluster from the remaining records,MDAV-generic for continuous attributes use arithmetic mean and Euclidean distance standardize attributes (subtract mean and divide by standard deviation) to give them equal weight for computing distances After MDAV-g
18、eneric, destandardize attributes xij is value of k-anonymized jth attribute for the ith record m10(j) and m2(j) are mean and variance of the k-anonymized jth attribute u10(j) and u2(j) are mean and variance of the original jth attribute,MDAV-generic for ordinal attributes The distance between two ca
19、tegories a and b in an attribute Vi: dord(a,b) = (|i| i b|) / |D(Vi)| i.e., the number of categories separating a and b divided by the number of categories in the attribute Nominal attributes The distance between two values is defined according to equality: 0 if theyre equal, else 1,Empirical Result
20、s,Continuous attributes From the U.S. Current Population Survey (1995) 1080 records described by 13 continuous attributes Computed k-anonymity for k = 3, ., 9 and quasi-identifiers with 6 and 13 attributes Categorical attributes From the U.S. Housing Survey (1993) Three ordinal and eight nominal att
21、ributes Computed k-anonymity for k = 2, ., 9 and quasi-identifiers with 3, 4, 8 and 11 attributes,IL measures for continuous attributes IL1 = mean variation of individual attributes in original and k-anonymous datasets IL2 = mean variation of attribute means in both datasets IL3 = mean variation of
22、attribute variances IL4 = mean variation of attribute covariances IL5 = mean variation of attribute Pearsons correlations IL6 = 100 times the average of IL1-6,MDAV-generic preserves means and variances The impact on the non-preserved statistics grows with the quasi-identifier length, as one would ex
23、pect For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k,IL measures for categorical attributes Dist: direct comparison of original and protected values using a categorical distance CTBIL: mean variation of frequencies in contingency tables for original and p
24、rotected data (based on another paper by Domingo-Ferrer and Torra) ACTBIL: CTBIL divided by the total number of cells in all considered tables EBIL: Entropy-based information loss (based on another paper by Domingo-Ferrer and Torra),Ordinal attribute protection using median,Ordinal attribute protect
25、ion using convex median,Anonymization using Microaggregation or Clustering,Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- ANONYMIZATIONALGORITHMSMICROAGGREGATIONANDCLUSTERINGPPT
链接地址:http://www.mydoc123.com/p-378427.html