When does the randomization fail to protect privacy? (ppt file)

advertisement
When Does Randomization
Fail to Protect Privacy?
Wenliang (Kevin) Du
Department of EECS, Syracuse University
1
Random Perturbation


Agrawal and Srikant’s SIGMOD paper.
Y=X+R
Original Data X
Random Noise R
Disguised Data Y
+
2
Random Perturbation


Most of the security analysis methods
based on randomization treat each
attribute separately.
Is that enough?

Does the relationship among data affect
privacy?
3
As we all know …


We can’t perturb the same number for
several times.
If we do that, we can estimate the original
data:





Let t be the original data,
Disguised data: t + R1, t + R2, …, t + Rm
Let Z = [(t+R1)+ … + (t+Rm)] / m
Mean: E(Z) = t
Variance: Var(Z) = Var(R) / m
4
This looks familiar …


This is the data set (x, x, x, x, x, x, x, x)
Random Perturbation:
(x+r1, x+r2,……, x+rm)

We know this is NOT safe.
Observation: the data set is highly
correlated.

5
Let’s Generalize!


Data set: (x1, x2, x3, ……, xm)
If the correlation among data attributes
are high, can we use that to improve
our estimation (from the disguised
data)?
6
Introduction





A heuristic approach toward privacy
analysis
Principal Component Analysis (PCA)
PCA-based data reconstruction
Experiment results
Conclusion and future work
7
Privacy Quantification:
A Heuristic Approach

Our goal:


to find a best-effort algorithm that
reconstructs the original data, based on
the available information.
Definition
n
m
1
*
PM F 
L( Di , j , Di , j )

n  m i 1 j 1
8
How to use the correlation?



High Correlation  Data Redundancy
Data Redundancy  Compression
Our goal: Lossy compression:



We do want to lose information, but
We don’t want to lose too much data,
We do want to lose the added noise.
9
PCA Introduction



The main use of PCA: reduce the
dimensionality while retaining as
much information as possible.
1st PC: containing the greatest
amount of variation.
2nd PC: containing the next largest
amount of variation.
10
Original Data
11
After Dimension Reduction
12
For the Original Data


They are correlated.
If we remove 50% of the dimensions,
the actual information loss might be
less than 10%.
13
For the Random Noises



They are not correlated.
Their variance is evenly distributed to
any direction.
If we remove 50% of the dimensions,
the actual noise loss should be 50%.
14
Data Reconstruction

Applying PCA



Find Principle Components: C = Q QT

Set Q to be the first p columns of Q.
Reconstruct the data:
 T

XYQQ
 T
 T
 T
 ( X  R)QQ  XQQ  RQQ
15
Random Noise R

 T
How does RQQ affect accuracy?

Theorem:
 T
p
Var ( RQQ )  Var ( R) ,
m
16
How to Conduct PCA on
Disguised Data?

Estimating Covariance Matrix
Cov(Yi , Y j )  Cov( X i  Ri , X j R j )
 Cov( X i , X j )   2 , for i  j

for i  j
Cov( X i , X j ),
17
Experiment 1: Increasing the
Number of Attributes
Normal Distribution
Uniform Distribution
18
Experiment 2: Increasing the number
of Principal Components
Normal Distribution
Uniform Distribution
19
Experiment 3: Increasing
Standard Deviation of Noises
Normal Distribution
Uniform Distribution
20
Conclusions



Privacy analysis based on individual attributes
is not sufficient. Correlation can disclose
information.
PCA can filter out some randomness from a
highly correlated data set.
When does randomization fail:


Answer: when the data correlation is high.
Can it be cured?
21
Future Work

How to improve the randomization to
reduce the information disclosure?


Making random noises correlated?
How to combine the PCA with the
univariate data reconstruction?
22
Download