Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant Talk Outline Taxonomy Integration (WWW 2001, with R. Agrawal) Searching with Numbers Privacy-Preserving Data Mining Taxonomy Integration B2B electronics portal: 2000 categories, 200K datasheets ICs DSP a b ICs Mem. c Logic d Master Catalog e Cat1 f x y Cat2 z w New Catalog Taxonomy Integration (2) After integration: ICs DSP a b x Mem. y c Logic d e f z w Goal Use affinity information in new catalog. – Products in same category are similar. Accuracy boost depends on match between two categorizations. Problem Statement Given – master categorization M: categories C1, C2, …, Cn set of documents in each category – new categorization N: categories S1, S2, …, Sn set of documents in each category Find the category in M for each document in N – Standard Alg: Estimate Pr(Ci | d) – Enhanced Alg: Estimate Pr(Ci | d, S) Naive Bayes Classifier Estimate probability of document d belonging to class Ci Pr(Ci ) Pr( d | Ci ) Pr(Ci | d ) Pr( d ) Where Number of documents in Ci Pr( Ci ) Total number of documents Pr( d | Ci ) Pr( t | Ci ) td # of occurrence s of t in Ci Pr( t | Ci ) Total words in Ci Enhanced Naïve Bayes Pr(Ci | d ) Pr(Ci ) Pr( d | Ci ) Standard: Enhanced: How do we estimate Pr(Ci|S)? – Apply standard Naïve Bayes to get number of documents in S that are classified into Ci – Incorporate weight w reflecting match between two taxonomies. Pr(Ci | d , S ) Pr(Ci | S ) Pr( d | Ci ) Only affect classification of borderline documents. – For w = 0, default to standard classifier. Enhanced Naïve Bayes (2) Pr(Ci | S ) | Ci | (# docs in S predicted to be in Ci ) w | Ci | (# docs in S predicted to be in Ci ) w Pr(Ci | S ) w (| C | (# docs in S predicted to be in C ) ) j j j Use tuning set to determine w. Intuition behind Algorithm Standard Algorithm Enhanced Algorithm C o m p u t e r D i g i t a l P e r i p h e r a l C a m e r a P 1 2 0 % 8 0 % P 2 4 0 % 6 0 % P 3 6 0 % 4 0 % C o m p u t e r D i g i t a l P e r i p h e r a l C a m e r a P 1 1 5 % 8 5 % P 2 3 0 % 7 0 % P 3 4 5 % 5 5 % Electronic Parts Dataset Accuracy Improvement on Pangea Data Accuracy 100 Perfect 90-10 80-20 GaussianA GaussianB Base 90 80 70 60 1 2 5 10 25 50 100 200 Weight 1150 categories; 37,000 documents Yahoo & OpenDirectory 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software – Typical match: 69%, 15%, 3%, 3%, 1%, …. Merging Yahoo into OpenDirectory – 30% fewer errors (14.1% absolute difference in accuracy) Merging OpenDirectory into Yahoo – 26% fewer errors (14.3% absolute difference) Summary New algorithm for taxonomy integration. – Exploits affinity information in the new (source) taxonomy categorizations. – Can do substantially better, and never does significantly worse than standard Naïve Bayes. Open Problems: SVM, Decision Tree, ... Talk Outline Taxonomy Integration Searching with Numbers (WWW 2002, with R. Agrawal) Privacy-Preserving Data Mining Motivation A large fraction of useful web consists of specification documents. – <attribute name, value> pairs embedded in text. Examples: – Data sheets for electronic parts. – Classified ads. – Product catalogs. Search Engines treat Numbers as Strings Search for 6798.32 (lunar nutation cycle) – Returns 2 pages on Google – However, search for 6798.320 yielded no page on Google (and all other search engines) Current search technology is inadequate for retrieving specification documents. Data Extraction is hard Synonyms for attribute names and units. – "lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing. – No "Speed", just "MHz Pentium III" – No "Memory", just "MB SDRAM" • 850 MHz Intel Pentium III • 192 MB RAM • 15 GB Hard Disk • DVD Recorder: Included; • Windows Me • 14.1 inch display • 8.0 pounds Searching with Numbers IBM ThinkPad 750 MHz Pentium 3, 196 MB DRAM, … Dell Computer 700 MHz Celeron, 256 MB SDRAM, … Database 800 200 800 200 3 lb IBM ThinkPad (750 MHz, 196 MB) … Dell (700 MHz, 256 MB) Reflectivity If we get a close match on numbers, how likely is it that we have correctly matched attribute names? – Likelihood Non-reflectivity (of data) Non-overlapping attributes Non-reflective. – Memory: 64- 512 Mb, Disk: 10 - 40 Gb Correlations or Clustering Low reflectivity. – Memory: 64 - 512 Mb, Disk: 10 - 100 Gb Reflectivity: Examples Non-Reflective High Reflectivity Low Reflectivity 50 50 50 40 40 40 30 30 30 20 20 20 10 10 10 0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Reflectivity: Definition Let – D: dataset – ni : co-ordinates of point xi – reflections(xi ): permutations of ni – (ni ): # of points within distance r of ni – (ni ): # of reflections within distance r of ni 1 (ni ) Non - Reflectivi ty | D | xiD (ni ) Algorithm How to compute match score (rank) of a document for a given query? How to limit the number of documents for which the match score is computed? Match Score of a Document Select k numbers from D yielding minimum distance between Q and D. Relative distance for each term: f ( qi , n j ) | qi n j | | qi ε | Euclidean distance (Lp norm) to combine term distances: F (Q, D) (i 1 f ( qi , n ji ) p )1/ p k Bipartite Graph Matching Map problem to Bipartite Graph Matching – k source nodes: corr. to query numbers – m target nodes: corr. to document numbers – An edge from each source to k nearest targets. Assign weight f(qi ,nj)p to the edge (qi ,nj). Doc: 10 .5 Query: 25 .25 20 75 .58 .25 60 Limiting the Set of Documents Similar to the score aggregation problem [Fagin, PODS 96] Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01] Limiting the set of documents 60 20 66/.1 D2 D7 25/.25 D1 D5 D7 D9 10/.5 D2 D3 75/.25 D1 D3 D4 D5 35/.75 D4 D6 D8 25/.58 D6 D8 D9 k conceptual sorted lists, one for each query term Do round robin access to the lists. For each document found, compute its distance F(D,Q) Let ni := number last looked at for query term qi p 1/ p τ : ( f ( q , n ) i1 i i ) Let k Halt when t documents found whose distance <= t is lower bound on distance of unseen documents Empirical Results 100 90 80 Precision 70 60 50 40 30 20 10 0 1 2 3 4 5 Trans Wine Auto Query Size DRAM Credit LCD Glass Proc Housing Empirical Results (2) Screen Shot Incorporating Hints Use simple data extraction techniques to get hints, • 256 MB SDRAM memory Unit Hint: MB Attribute Hint: SDRAM, memory Names/Units in query matched against Hints. Summary Allows querying using only numbers or numbers + hints. Data can come from raw text (e.g. product descriptions) or databases. End run around data extraction. – Use simple extractor to generate hints. Open Problems: integration with keyword search. Talk Outline Taxonomy Integration Searching with Numbers Privacy-Preserving Data Mining – Motivation – Classification – Associations Growing Privacy Concerns Popular Press: – Economist: The End of Privacy (May 99) – Time: The Death of Privacy (Aug 97) Govt. legislation: – European directive on privacy protection (Oct 98) – Canadian Personal Information Protection Act (Jan 2001) Special issue on internet privacy, CACM, Feb 99 S. Garfinkel, "Database Nation: The Death of Privacy in 21st Century", O' Reilly, Jan 2000 Privacy Concerns (2) Surveys of web users – 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) – 82% said having privacy policy would matter (Freebies & Privacy: What net users think, July 99) Technical Question Fear: – "Join" (record overlay) was the original sin. – Data mining: new, powerful adversary? The primary task in data mining: development of models about aggregated data. Can we develop accurate models without access to precise information in individual data records? Talk Outline Taxonomy Integration Searching with Numbers Privacy-Preserving Data Mining – Motivation – Private Information Retrieval – Classification (SIGMOD 2000, with R. Agrawal) – Associations Web Demographics Volvo S40 website targets people in 20s – Are visitors in their 20s or 40s? – Which demographic groups like/dislike the website? Solution Overview 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms ... ... ... Model Reconstruction Problem Original values x1, x2, ..., xn – from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn – from probability distribution Y Given – x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y Estimate the probability distribution of X. Intuition (Reconstruct single point) Use Bayes' rule for density functions 1 0V A g e 9 0 O r i g i n a l d i s t r i b u t i o n f o r A g e P r o b a b i l i s t i c e s t i m a t e o f o r i g i n a l v a l u e o f V Intuition (Reconstruct single point) Use Bayes' rule for density functions 1 0V A g e 9 0 O r i g i n a l D i s t r i b u t i o n f o r A g e P r o b a b i l i s t i c e s t i m a t e o f o r i g i n a l v a l u e o f V Reconstructing the Distribution Combine estimates of where point came from for all the points: – Gives estimate of original distribution. 1 0 A g e 9 0 Reconstruction: Bootstrapping fX0 := Uniform distribution j := 0 // Iteration number repeat j n 1 f (( x y ) a ) f j 1 Y i i X (a ) f ( a ) : – x n i 1 f (( x y ) a ) f j (a ) X Y i i – j := j+1 (Bayes' rule) until (stopping criterion met) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001. Seems to work well! Number of People 1200 1000 800 Original Randomized Reconstructed 600 400 200 0 20 60 Age Recap: Why is privacy preserved? Cannot reconstruct individual values accurately. Can only reconstruct distributions. Talk Outline Taxonomy Integration Searching with Numbers Privacy-Preserving Data Mining – Motivation – Private Information Retrieval – Classification – Associations (KDD 2002, with A. Evfimievski, R. Agrawal & J. Gehrke) Association Rules Given: – a set of transactions – each transaction is a set of items Association Rule: 30% of transactions that contain Book1 and Book5 also contain Book20; 5% of transactions contain these items. – 30% : confidence of the rule. – 5% : support of the rule. Find all association rules that satisfy user-specified minimum support and minimum confidence constraints. Can be used to generate recommendations. Recommendations Overview Alice Book 1, Book 11, Book 21 Book 1, Book 7, Book 21 Recommendation Service Support Recovery Associations Bob Book 5, Book 25 Book 3, Book 25 Recommendations Private Information Retrieval Retrieve 1 of n documents from a digital library without the library knowing which document was retrieved. Trivial solution: Download entire library. Can you do better? – Yes, with multiple servers. – Yes, with single server & computational privacy. Problem introduced in [Chor et al, FOCS 95] Uniform Randomization Given a transaction, – keep item with 20% probability, – replace with a new random item with 80% probability. Appears to gives around 80% privacy… – 80% chance that an item in the randomized transaction was not in the original transaction. Privacy Breach Example 10 M transactions of size 3 with 1000 items: 100,000 (1%) have {x, y, z} 0.23 = .008 800 transactions 99.99% 9,900,000 (99%) have zero items from {x, y, z} 6 * (0.8/1000)3 = 3 * 10-9 .03 transactions (<< 1) 0.01% 80% privacy “on average,” but not for all items! Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton Insert Hide No many false items into each transaction. true itemsets among false ones. free lunch: Need more transactions to discover associations. Related Work S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002. Protecting privacy across databases: – Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”, Crypto 2000. – J. Vaidya and C.W. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, KDD 2002. Summary Have your cake and mine it too! – Preserve privacy at the individual level, but still build accurate models. – Can do both classification & association rules. Open Problems: Clustering, Lower bounds on discoverability versus privacy, Faster algorithms, … Slides available from ... www.almaden.ibm.com/cs/people/srikant/talks .html Backup Lowest Discoverable Support |t| = 5, = 50% LDS vs. number of transactions 1.2 1-itemsets 2-itemsets 3-itemsets 1 LDS is s.t., when predicted, 0.8 is 4 away from zero. 0.6 Roughly, LDS is proportional to 1 LDS, % T 0.4 0.2 0 1 10 Number of transactions, millions 100 LDS vs. Breach Level 2.5 LDS, % 2 1.5 1 0.5 0 30 40 50 60 70 Privacy Breach Level, % |t| = 5, |T| = 5 M 80 90 Basic 2-server Scheme Each server returns XOR of green bits. Client XORs bits returned by server. Communication complexity: O(n) 1 2 3 4 5 6 7 8 Sqrt(n) Algorithm 1 2 3 4 5 6 7 8 Each server returns bitwise XOR of specified blocks. Client XORs the 2 blocks & selects desired bits. Each block has sqrt(n) elements => 4*sqrt(n) communication complexity. Server computation time still O(n) Computationally Private IR Use pseudo-random function + mask to generate sets. Quadratic residuosity. Difficulty of deciding whether a small prime divides (m) – m: composite integer of unknown factorization – (m): Euler totient fn, i.e., # of positive integers <=m that are relatively prime to m. Extensions Retrieve documents (blocks), not bits. – If n <= l, comm. complexity 4l. – If n <= l2/4, comm. complexity 8l. Lower communication complexity. Select documents using keywords. Protect data privacy. Preprocessing to reduce computation time. Computationally-private information retrieval with single server. Potential Privacy Breaches Distribution is a spike. – Example: Everyone is of age 40. Some randomized values are only possible from a given range. – Example: Add U[-50,+50] to age and get 125 True age is 75. – Not an issue with Gaussian. Potential Privacy Breaches (2) Most randomized values in a given interval come from a given interval. – Example: 60% of the people whose randomized value is in [120,130] have their true age in [70,80]. – Implication: Higher levels of randomization will be required. Correlations can make previous effect worse. – Example: 80% of the people whose randomized value of age is in [120,130] and whose randomized value of income is [...] have their true age in [70,80]. Work in Statistical Databases Provide statistical information without compromising sensitive information about individuals (surveys: AW89, Sho82) Techniques – Query Restriction – Data Perturbation Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89] Statistical Databases: Techniques Query Restriction – restrict the size of query result (e.g. FEL72, DDS79) – control overlap among successive queries (e.g. DJL79) – suppress small data cells (e.g. CO82) Output Perturbation – sample result of query (e.g. Den80) – add noise to query result (e.g. Bec80) Data Perturbation – replace db with sample (e.g. LST83, LCL85, Rei84) – swap values between records (e.g. Den82) – add noise to values (e.g. TYW84, War65) Statistical Databases: Comparison We do not assume original data is aggregated into a single database. Concept of reconstructing original distribution. – Adding noise to data values problematic without such reconstruction.