Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness XintaoWu Nov 19,2015 1 Drivers of Data Computing Reliability Security Privacy Usability 6A’s Anytime Anywhere Access to Anything by Anyone Authorized 2 4V’s Volume Velocity Variety Veracity 4V’s 3 AVC Denial Log Analysis Volume and Velocity:1 million log files per day and each has thousands entries S3, Hive and EMR on AWS 4 Social Media Customer Analytics Unstructured text (e.g., blog, tweet) Product and review Transaction database Structured profile name sex age disease salary id Sex age address Income Ada F 18 cancer 25k 5 F Y NC 25k Bob M 25 heart 110k 3 M Y SC 110k … Network topology (friendship,followship,interaction) 5 Retweet sequence Entity resolution Patterns Temporal/spatial Scalability Visualization Sentiment Privacy Variety, Veracity 10GB tweets per day Belk and Lowe’s UNCC Chancellor’s special fund A Single View to the Customer Banking Finance Social Media Our Known History Customer Gaming Entertain Purchase Outline Introduction Privacy Preserving Social Network Analysis Input perturbation Output perturbation Anti-discrimination Learning 7 Privacy Breach Cases Nydia Velázquez (1994) Medical record on her suicide attempt was disclosed AOL Search Log (2006) Anonymized release of 650K users’ search histories lasted for less than 24 hours NetFlix Contest (2009) $1M contest was cancelled due to privacy lawsuit 23andMe (2013) Genetic testing was ordered to discontinue by FDA due to genetic privacy 8 Acxiom Privacy In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack." In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data. Security In 2003, more than 1.6 billion customer records were stolen during the transmission of information to and from Acxiom's clients. 9 Privacy Regulation -- Forrester Most restricted 10 Effectively no restrictions Restricted Some restrictions Minimal restrictions No legislation or no information Privacy Protection Laws USA HIPAA for health care Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for children online privacy State regulations, e.g., California State Bill 1386 Canada PIPEDA 2000 - Personal Information Protection and Electronic Documents Act European Union Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy Contractual obligations Individuals should have notice about how their data is used and have opt-out choices 11 Privacy Preserving Data Mining ssn race … age Sex income … disease 28223 Asian … 20 M 85k … Cancer 28223 Asian … 30 F 70k … Flu 28262 Black … 20 M 120k … Heart 28261 White … 26 M 23k … Cancer . . … . . . … . 28223 Asian … 20 M 110k … Flu name zip 69% unique on zip and birth date 87% with zip, birth date and gender Generalization (k-anonymity, ldiversity, t-closeness) Randomization 12 Privacy Preserving Data Mining 13 13 13 Social Network Data Data miner Data owner release 14 name sex age disease salary id Sex age disease salary Ada F 18 cancer 25k 5 F Y cancer 25k Bob M 25 heart 110k 3 M Y heart 110k Cathy F 20 cancer 70k 6 F Y cancer 70k Dell M 65 flu 65k 1 M O flu 65k Ed M 60 cancer 300k 7 M O cancer 300k Fred M 24 flu 20k 2 M Y flu 20k George M 22 cancer 45k 9 M Y cancer 45k Harry M 40 flu 95k 4 M M flu 95k Irene F 45 heart 70k 8 F M heart 70k Threat of Re-identification Attacker attack 15 id Sex age disease salary 5 F Y cancer 25k 3 M Y heart 110k 6 F Y cancer 70k 1 M O flu 65k 7 M O cancer 300k 2 M Y flu 20k 9 M Y cancer 45k 4 M M flu 95k 8 F M heart 70k Privacy breaches Identity disclosure Link disclosure Attribute disclosure Privacy Preservation in Social Network Analysis • Input Perturbation • K-anonymity • Generalization • Randomization 16 Our Work Feature preservation randomization Spectrum preserving randomization (SDM08) Markov chain based feature preserving randomization (SDM09) Reconstruction from randomized graph (SDM10) Link privacy (from the attacker perspective) Exploiting node similarity feature (PAKDD09 Best Student Paper Runner-up Award) Exploiting graph space via Markov chain (SDM09) 17 Output Perturbation Data miner Data owner Query f Query result + noise 22 name sex age disease salary Ada F 18 cancer 25k Bob M 25 heart 110k Cathy F 20 cancer 70k Dell M 65 flu 65k Ed M 60 cancer 300k Fred M 24 flu 20k George M 22 cancer 45k Harry M 40 flu 95k Irene F 45 heart 70k Cannot be used to derive whether any individual is included in the database Differential Guarantee [Dwork, TCC06] name disease Ada cancer Bob heart Cathy cancer Dell flu Ed cancer Fred flu name disease Ada cancer Bob heart Cathy cancer Dell flu Ed cancer Fred flu K f count(#cancer) f(x) + noise 3 + noise K f count(#cancer) f(x’) + noise 2 + noise achieving Opt-Out 23 Differential Privacy is a privacy parameter: 24 smaller = stronger privacy Calibrating Noise Laplace distribution Sensitivity of function global sensitivity local sensitivity 25 Sensitivity L-1 distance for vector output 26 name sex age disease salary Ada F 18 cancer 25k Function f sensitivity Bob M 25 heart 110k Cathy F 20 cancer 70k Count(#cancer) 1 Dell M 65 flu 65k Sum(salary) u (domain upper bound) Ed M 60 cancer 300k Avg(salary) u/n Fred M 24 flu 20k George M 22 cancer 45k Harry M 40 flu 95k Irene F 45 heart 70k Data mining tasks can be decomposed to a sequence of simple functions. Challenge in OSN Degree sequence, D=2, noise from Lap(2/) is needed [1,1,3,3,3,3,2] [1,1,3,3,2,2,2] # of triangles, =n-2, huge noise is needed High sensitivity! n-2 27 0 Advanced Mechanisms Possible theoretical approaches Smooth sensitivity Exponential mechanism Functional mechanism Sampling 28 Our Work DP-preserving cluster coefficient (ASONAM12) DP-preserving spectral graph analysis (PAKDD13) Linear-refinement of DP-preserving query answering (PAKDD13 Best Application Paper) DP-preserving graph generation based on degree correlation (TDP13) Regression model fitting under differential privacy and model inversion attack (IJCAI 15) DP-preservation for deep auto-encoders (AAAI 16) 29 SMASH (NIH 30 R01GM103309) Genetic Privacy (NSF 1502273 and 1523115) BIBM13 Best Paper Award 31 Outline Introduction Privacy Preserving Social Network Analysis Input perturbation Output perturbation Anti-discrimination Learning 32 What is discrimination? Discrimination refers to unjustified distinctions of individuals based on their membership in a certain group. Federal Laws and regulations disallow discrimination on several grounds: Gender, Age, Marital Status, Sexual Orientation, Race, Religion or Belief, Disability or Illness …… These attributes are referred to as the protected attributes. Predictive Learning Finding evidence of discrimination Historical Data Test Data Classifier Building non discriminatory classifiers Result Motivating Example name sex age program acceptance Ada F 18 cancer + Bob M 25 heart _ Cathy F 20 cancer + Ed M 60 cancer _ Fred M 24 flu _ … 35 Suppose 2000 applicants, 1000 M and 1000 F Acceptance ratio 36% M vs. 24% F Do we have discrimination here? Discrimination Discovery Assuming a causal Bayesian network that faithfully represents the data. Protected attribute Decision attribute c+, c- e+, e- ∆P = P(e+|c+) − P(e+|c−) Discriminatory effect if ∆P > τ, where τ is a threshold for discrimination depending on law (e.g., 5%). Motivate Examples Case I ∆P = 0.1 Case II ∆P = -0.01 Motivate Examples Case II ∆P = -0.01 ∆P + + Case III ∆P = 0.104 ∆P - + Discrimination Analysis • Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit. (Wikipedia) • Tweets discrimination analysis aims to detect whether a tweet contains discrimination against gender, race, age, etc. A Typical Deep Learning Pipeline for Text Classification word Text semantic composition Multilayer Perception Recursive Neural Network Recurrent Neural Network Convolutional Neural Network Word Representation Deep Learning Model text Text Representation Text Representation Softmax Classifier Word Embeddings Tweet … 𝑤𝑎𝑛𝑡 𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦 𝑔𝑖𝑟𝑙𝑠 LSTM-RNN … Word Embeddings … Tweet 𝑤𝑎𝑛𝑡 … 𝑔𝑖𝑟𝑙𝑠 𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦 Tweet Representation Mean Pooling LSTM-RNN … Word Embeddings … Tweet 𝑤𝑎𝑛𝑡 … 𝑔𝑖𝑟𝑙𝑠 𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦 y Logistic Regression Tweet Representation Mean Pooling LSTM-RNN … Word Embeddings … Tweet 𝑤𝑎𝑛𝑡 … 𝑔𝑖𝑟𝑙𝑠 𝑙𝑒𝑎𝑟𝑛 𝑝ℎ𝑜𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑦 Summary 1. Preserving Privacy Values 2. Educating Robustly and Responsibly 3. Big Data and Discrimination 4. Law Enforcement & Security 5. Data as a Public Resouce 45 Acknowledgement Collaborators: • UNCC: Aidong Lu, Xinghua Shi,Yong Ge • Oregon: Jun Li, Dejing Dou • PeaceHealth: Brigitte Piniewski • UIUC: Tao Xie DPL members: • UNCC: PhD graduates: Songtao Guo, Ling Guo, Kai Pan, LetingWu, Xiaowei Ying. PhD students: Yue Wang,Yuemeng Li, Zhilin Luo (visiting) • UofA: Lu Zhang (postdoc),YongkaiWu, Cheng Si, Miao Xie, ShuhanYuan Funding support: 46 Genome Wide Association Study 47