Social computing

advertisement
Social Computing in Big Data Era – Privacy
Preservation and Fairness Awareness
XintaoWu
Nov 19,2015
1
Drivers of Data Computing
Reliability
Security
Privacy
Usability
6A’s
Anytime
Anywhere
Access to
Anything by
Anyone
Authorized
2
4V’s
Volume
Velocity
Variety
Veracity
4V’s
3
AVC Denial Log Analysis
Volume and Velocity:1 million log files per day and each has thousands entries
S3, Hive and EMR on AWS
4
Social Media
Customer Analytics
Unstructured text (e.g., blog, tweet)
Product and review
Transaction database
Structured profile
name
sex
age
disease
salary
id
Sex
age
address
Income
Ada
F
18
cancer
25k
5
F
Y
NC
25k
Bob
M
25
heart
110k
3
M
Y
SC
110k
…
Network topology
(friendship,followship,interaction)
5
Retweet sequence
Entity resolution
Patterns
Temporal/spatial
Scalability
Visualization
Sentiment
Privacy
Variety, Veracity
10GB tweets per day
Belk and Lowe’s
UNCC Chancellor’s
special fund
A Single View to the Customer
Banking
Finance
Social
Media
Our
Known
History
Customer
Gaming
Entertain
Purchase
Outline
 Introduction
 Privacy Preserving Social Network Analysis
 Input perturbation
 Output perturbation
 Anti-discrimination Learning
7
Privacy Breach Cases
 Nydia Velázquez (1994)
 Medical record on her suicide attempt was disclosed
 AOL Search Log (2006)
 Anonymized release of 650K users’ search histories lasted
for less than 24 hours
 NetFlix Contest (2009)
 $1M contest was cancelled due to privacy lawsuit
 23andMe (2013)
 Genetic testing was ordered to discontinue by FDA due to
genetic privacy
8
Acxiom
 Privacy
 In 2003, the EPIC alleged Acxiom provided consumer
information to US Army "to determine how information from
public and private records might be analyzed to help defend
military bases from attack."
 In 2013 Acxiom was among nine companies that the FTC
investigated to see how they collect and use consumer data.
 Security
 In 2003, more than 1.6 billion customer records were stolen
during the transmission of information to and from Acxiom's
clients.
9
Privacy Regulation -- Forrester
Most restricted
10
Effectively no restrictions
Restricted
Some restrictions
Minimal restrictions
No legislation or no information
Privacy Protection Laws
 USA
 HIPAA for health care
 Grann-Leach-Bliley Act of 1999 for financial institutions
 COPPA for children online privacy
 State regulations, e.g., California State Bill 1386
 Canada
 PIPEDA 2000 - Personal Information Protection and Electronic Documents Act
 European Union
 Directive 94/46/EC - Provides guidelines for member state legislation and forbids
sharing data with states that do not protect privacy
 Contractual obligations
 Individuals should have notice about how their data is used and have opt-out
choices
11
Privacy Preserving Data Mining
ssn
race
…
age
Sex
income
…
disease
28223
Asian
…
20
M
85k
…
Cancer
28223
Asian
…
30
F
70k
…
Flu
28262
Black
…
20
M
120k
…
Heart
28261
White
…
26
M
23k
…
Cancer
.
.
…
.
.
.
…
.
28223
Asian
…
20
M
110k
…
Flu
name zip
69% unique on zip and birth date
87% with zip, birth date and gender
Generalization (k-anonymity, ldiversity, t-closeness)
Randomization
12
Privacy Preserving Data Mining
13
13
13
Social Network Data
Data miner
Data owner
release
14
name
sex
age
disease
salary
id
Sex
age
disease
salary
Ada
F
18
cancer
25k
5
F
Y
cancer
25k
Bob
M
25
heart
110k
3
M
Y
heart
110k
Cathy
F
20
cancer
70k
6
F
Y
cancer
70k
Dell
M
65
flu
65k
1
M
O
flu
65k
Ed
M
60
cancer
300k
7
M
O
cancer
300k
Fred
M
24
flu
20k
2
M
Y
flu
20k
George
M
22
cancer
45k
9
M
Y
cancer
45k
Harry
M
40
flu
95k
4
M
M
flu
95k
Irene
F
45
heart
70k
8
F
M
heart
70k
Threat of Re-identification
Attacker
attack
15
id
Sex
age
disease
salary
5
F
Y
cancer
25k
3
M
Y
heart
110k
6
F
Y
cancer
70k
1
M
O
flu
65k
7
M
O
cancer
300k
2
M
Y
flu
20k
9
M
Y
cancer
45k
4
M
M
flu
95k
8
F
M
heart
70k
Privacy breaches
Identity disclosure
Link disclosure
Attribute disclosure
Privacy Preservation in Social Network Analysis
• Input Perturbation
• K-anonymity
• Generalization
• Randomization
16
Our Work
 Feature preservation randomization
 Spectrum preserving randomization (SDM08)
 Markov chain based feature preserving randomization
(SDM09)
 Reconstruction from randomized graph (SDM10)
 Link privacy (from the attacker perspective)
 Exploiting node similarity feature (PAKDD09 Best Student
Paper Runner-up Award)
 Exploiting graph space via Markov chain (SDM09)
17
Output Perturbation
Data miner
Data owner
Query f
Query result + noise
22
name
sex
age
disease
salary
Ada
F
18
cancer
25k
Bob
M
25
heart
110k
Cathy
F
20
cancer
70k
Dell
M
65
flu
65k
Ed
M
60
cancer
300k
Fred
M
24
flu
20k
George
M
22
cancer
45k
Harry
M
40
flu
95k
Irene
F
45
heart
70k
Cannot be used to derive whether any
individual is included in the database
Differential Guarantee [Dwork, TCC06]
name
disease
Ada
cancer
Bob
heart
Cathy
cancer
Dell
flu
Ed
cancer
Fred
flu
name
disease
Ada
cancer
Bob
heart
Cathy
cancer
Dell
flu
Ed
cancer
Fred
flu
K
f count(#cancer)
f(x) + noise
3 + noise
K
f count(#cancer)
f(x’) + noise
2 + noise
achieving Opt-Out
23
Differential Privacy
 is a privacy parameter:
24
smaller  = stronger privacy
Calibrating Noise
 Laplace distribution
 Sensitivity of function
 global sensitivity
 local sensitivity
25
Sensitivity
L-1 distance for
vector output
26
name
sex
age
disease
salary
Ada
F
18
cancer
25k
Function f
sensitivity
Bob
M
25
heart
110k
Cathy
F
20
cancer
70k
Count(#cancer)
1
Dell
M
65
flu
65k
Sum(salary)
u (domain upper bound)
Ed
M
60
cancer
300k
Avg(salary)
u/n
Fred
M
24
flu
20k
George
M
22
cancer
45k
Harry
M
40
flu
95k
Irene
F
45
heart
70k
Data mining tasks can be decomposed to a
sequence of simple functions.
Challenge in OSN
Degree sequence, D=2, noise from Lap(2/) is needed
[1,1,3,3,3,3,2]
[1,1,3,3,2,2,2]
# of triangles, =n-2, huge noise is needed
High sensitivity!
n-2
27
0
Advanced Mechanisms
 Possible theoretical approaches
 Smooth sensitivity
 Exponential mechanism
 Functional mechanism
 Sampling
28
Our Work
 DP-preserving cluster coefficient (ASONAM12)
 DP-preserving spectral graph analysis (PAKDD13)
 Linear-refinement of DP-preserving query answering
(PAKDD13 Best Application Paper)
 DP-preserving graph generation based on degree
correlation (TDP13)
 Regression model fitting under differential privacy and
model inversion attack (IJCAI 15)
 DP-preservation for deep auto-encoders (AAAI 16)
29
SMASH (NIH
30
R01GM103309)
Genetic Privacy
(NSF 1502273 and 1523115)
BIBM13 Best Paper Award
31
Outline
 Introduction
 Privacy Preserving Social Network Analysis
 Input perturbation
 Output perturbation
 Anti-discrimination Learning
32
What is discrimination?

Discrimination refers to unjustified distinctions of
individuals based on their membership in a certain
group.

Federal Laws and regulations disallow discrimination on
several grounds:

Gender, Age, Marital Status, Sexual Orientation, Race,
Religion or Belief, Disability or Illness ……

These attributes are referred to as the protected
attributes.
Predictive Learning
Finding evidence of discrimination
Historical
Data
Test Data
Classifier
Building non discriminatory classifiers
Result
Motivating Example
name
sex
age
program
acceptance
Ada
F
18
cancer
+
Bob
M
25
heart
_
Cathy
F
20
cancer
+
Ed
M
60
cancer
_
Fred
M
24
flu
_
…
35
Suppose 2000 applicants, 1000 M and 1000 F
Acceptance ratio 36% M vs. 24% F
Do we have discrimination here?
Discrimination Discovery

Assuming a causal Bayesian network that faithfully represents the data.
Protected
attribute
Decision
attribute
c+, c-
e+, e-
∆P = P(e+|c+) − P(e+|c−)

Discriminatory effect if ∆P > τ, where τ is a threshold for discrimination depending on
law (e.g., 5%).
Motivate Examples

Case I
∆P = 0.1

Case II
∆P = -0.01
Motivate Examples

Case II
∆P = -0.01
∆P

+
+
Case III
∆P = 0.104
∆P
-
+
Discrimination Analysis
• Discrimination is treatment or consideration of, or making a
distinction in favor of or against, a person or thing based on the
group, class, or category to which that person or thing is perceived to
belong to rather than on individual merit. (Wikipedia)
• Tweets discrimination analysis aims to detect whether a tweet
contains discrimination against gender, race, age, etc.
A Typical Deep Learning Pipeline for Text
Classification
word
Text
semantic composition
Multilayer
Perception
Recursive Neural
Network
Recurrent Neural
Network
Convolutional Neural
Network
Word Representation Deep
Learning
Model
text
Text
Representation
Text Representation
Softmax
Classifier
Word
Embeddings
Tweet
…
𝑤𝑎𝑛𝑡
𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦 𝑔𝑖𝑟𝑙𝑠
LSTM-RNN
…
Word
Embeddings
…
Tweet
𝑤𝑎𝑛𝑡
… 𝑔𝑖𝑟𝑙𝑠
𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦
Tweet
Representation
Mean Pooling
LSTM-RNN
…
Word
Embeddings
…
Tweet
𝑤𝑎𝑛𝑡
… 𝑔𝑖𝑟𝑙𝑠
𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦
y
Logistic Regression
Tweet
Representation
Mean Pooling
LSTM-RNN
…
Word
Embeddings
…
Tweet
𝑤𝑎𝑛𝑡
… 𝑔𝑖𝑟𝑙𝑠
𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦
Summary
1. Preserving Privacy Values
2. Educating Robustly and
Responsibly
3. Big Data and Discrimination
4. Law Enforcement & Security
5. Data as a Public Resouce
45
Acknowledgement
Collaborators:
• UNCC: Aidong Lu, Xinghua Shi,Yong Ge
• Oregon: Jun Li, Dejing Dou
• PeaceHealth: Brigitte Piniewski
• UIUC: Tao Xie
DPL members:
• UNCC: PhD graduates: Songtao Guo, Ling Guo, Kai Pan, LetingWu, Xiaowei Ying.
PhD students: Yue Wang,Yuemeng Li, Zhilin Luo (visiting)
• UofA: Lu Zhang (postdoc),YongkaiWu, Cheng Si, Miao Xie, ShuhanYuan
Funding support:
46
Genome Wide Association Study
47
Download