Data Mining Technologies for Digital Libraries and Web Information

advertisement
Data Mining Technologies
for Digital Libraries
& Web Information Systems
Ramakrishnan Srikant
Talk Outline

Taxonomy Integration (WWW 2001, with R.
Agrawal)

Searching with Numbers

Privacy-Preserving Data Mining
Taxonomy Integration

B2B electronics portal: 2000 categories, 200K
datasheets
ICs
DSP
a
b
ICs
Mem.
c
Logic
d
Master Catalog
e
Cat1
f
x
y
Cat2
z
w
New Catalog
Taxonomy Integration (2)

After integration:
ICs
DSP
a
b
x
Mem.
y
c
Logic
d
e
f
z
w
Goal

Use affinity information in new catalog.
– Products in same category are similar.

Accuracy boost depends on match between two
categorizations.
Problem Statement

Given
– master categorization M: categories C1, C2, …, Cn

set of documents in each category
– new categorization N: categories S1, S2, …, Sn


set of documents in each category
Find the category in M for each document in N
– Standard Alg: Estimate Pr(Ci | d)
– Enhanced Alg: Estimate Pr(Ci | d, S)
Naive Bayes Classifier


Estimate probability of document d belonging to
class Ci
Pr(Ci ) Pr( d | Ci )
Pr(Ci | d ) 
Pr( d )
Where
Number of documents in Ci
Pr( Ci ) 
Total number of documents
Pr( d | Ci )   Pr( t | Ci )
td
# of occurrence s of t in Ci
Pr( t | Ci ) 
Total words in Ci
Enhanced Naïve Bayes
Pr(Ci | d )  Pr(Ci ) Pr( d | Ci )

Standard:

Enhanced:

How do we estimate Pr(Ci|S)?
– Apply standard Naïve Bayes to get number of
documents in S that are classified into Ci
– Incorporate weight w reflecting match between
two taxonomies.

Pr(Ci | d , S )  Pr(Ci | S ) Pr( d | Ci )
Only affect classification of borderline documents.
– For w = 0, default to standard classifier.
Enhanced Naïve Bayes (2)
Pr(Ci | S ) | Ci | (# docs in S predicted to be in Ci ) w
| Ci | (# docs in S predicted to be in Ci ) w
Pr(Ci | S ) 
w
(|
C
|

(#
docs
in
S
predicted
to
be
in
C
)
)
j j
j

Use tuning set to determine w.
Intuition behind Algorithm
Standard
Algorithm
Enhanced
Algorithm
C
o
m
p
u
t
e
r
D
i
g
i
t
a
l
P
e
r
i
p
h
e
r
a
l
C
a
m
e
r
a
P
1
2
0
%
8
0
%
P
2
4
0
%
6
0
%
P
3
6
0
%
4
0
%
C
o
m
p
u
t
e
r
D
i
g
i
t
a
l
P
e
r
i
p
h
e
r
a
l
C
a
m
e
r
a
P
1
1
5
%
8
5
%
P
2
3
0
%
7
0
%
P
3
4
5
%
5
5
%
Electronic Parts Dataset
Accuracy Improvement on Pangea Data
Accuracy
100
Perfect
90-10
80-20
GaussianA
GaussianB
Base
90
80
70
60
1
2
5
10 25 50 100 200
Weight
1150 categories; 37,000 documents
Yahoo & OpenDirectory

5 slices of the hierarchy: Autos, Movies, Outdoors,
Photography, Software
– Typical match: 69%, 15%, 3%, 3%, 1%, ….

Merging Yahoo into OpenDirectory
– 30% fewer errors (14.1% absolute difference in
accuracy)

Merging OpenDirectory into Yahoo
– 26% fewer errors (14.3% absolute difference)
Summary

New algorithm for taxonomy integration.
– Exploits affinity information in the new (source)
taxonomy categorizations.
– Can do substantially better, and never does
significantly worse than standard Naïve Bayes.

Open Problems: SVM, Decision Tree, ...
Talk Outline

Taxonomy Integration

Searching with Numbers (WWW 2002, with R.
Agrawal)

Privacy-Preserving Data Mining
Motivation

A large fraction of useful web consists of specification
documents.
– <attribute name, value> pairs embedded in text.

Examples:
– Data sheets for electronic parts.
– Classified ads.
– Product catalogs.
Search Engines treat
Numbers as Strings

Search for 6798.32 (lunar nutation cycle)
– Returns 2 pages on Google
– However, search for 6798.320 yielded no page
on Google (and all other search engines)

Current search technology is inadequate for
retrieving specification documents.
Data Extraction is hard


Synonyms for attribute
names and units.
– "lb" and "pounds", but no
"lbs" or "pound".
Attribute names are often
missing.
– No "Speed", just "MHz
Pentium III"
– No "Memory", just "MB
SDRAM"
• 850 MHz Intel Pentium III
• 192 MB RAM
• 15 GB Hard Disk
• DVD Recorder: Included;
• Windows Me
• 14.1 inch display
• 8.0 pounds
Searching with Numbers
IBM ThinkPad
750 MHz Pentium 3,
196 MB DRAM, …
Dell Computer
700 MHz Celeron,
256 MB SDRAM, …
Database
800 200
800 200 3 lb
IBM ThinkPad
(750 MHz, 196 MB)
…
Dell (700 MHz, 256 MB)
Reflectivity

If we get a close match on numbers, how likely is it
that we have correctly matched attribute names?
– Likelihood  Non-reflectivity (of data)

Non-overlapping attributes  Non-reflective.
– Memory: 64- 512 Mb, Disk: 10 - 40 Gb

Correlations or Clustering  Low reflectivity.
– Memory: 64 - 512 Mb, Disk: 10 - 100 Gb
Reflectivity: Examples
Non-Reflective
High Reflectivity
Low Reflectivity
50
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
0
10
20
30
40
50
0
10
20
30
40
50
0
10
20
30
40
50
Reflectivity: Definition

Let
– D: dataset
– ni : co-ordinates of point xi
– reflections(xi ): permutations of ni
– (ni ): # of points within distance r of ni
– (ni ): # of reflections within distance r of ni
1
 (ni )
Non - Reflectivi ty 

| D | xiD  (ni )
Algorithm

How to compute match score (rank) of a document
for a given query?

How to limit the number of documents for which the
match score is computed?
Match Score of a Document

Select k numbers from D yielding minimum
distance between Q and D.

Relative distance for each term:
f ( qi , n j ) 

| qi  n j |
| qi  ε |
Euclidean distance (Lp norm) to combine term
distances:
F (Q, D)  (i 1 f ( qi , n ji ) p )1/ p
k
Bipartite Graph Matching

Map problem to Bipartite Graph Matching
– k source nodes: corr. to query numbers
– m target nodes: corr. to document numbers
– An edge from each source to k nearest targets.
Assign weight f(qi ,nj)p to the edge (qi ,nj).
Doc:
10
.5
Query:
25
.25
20
75
.58
.25
60
Limiting the Set of Documents

Similar to the score aggregation problem [Fagin,
PODS 96]

Proposed algorithm is an adaptation of the TA
algorithm in [Fagin-Lotem-Naor, PODS 01]
Limiting the set of documents
60
20



66/.1
D2 D7
25/.25
D1 D5 D7 D9
10/.5
D2 D3
75/.25
D1 D3 D4 D5
35/.75
D4 D6 D8
25/.58
D6 D8 D9
k conceptual sorted lists, one for each query term
Do round robin access to the lists. For each
document found, compute its distance F(D,Q)
Let ni := number last looked at for query term qi
p 1/ p
τ
:

(
f
(
q
,
n
)
i1 i i )
Let
k



Halt when t documents found whose distance <= 
t is lower bound on distance of unseen documents
Empirical Results
100
90
80
Precision
70
60
50
40
30
20
10
0
1
2
3
4
5
Trans
Wine
Auto
Query Size
DRAM
Credit
LCD
Glass
Proc
Housing
Empirical Results (2)

Screen Shot
Incorporating Hints

Use simple data extraction techniques to get hints,
• 256 MB SDRAM memory
Unit Hint:
MB

Attribute Hint:
SDRAM, memory
Names/Units in query matched against Hints.
Summary

Allows querying using only numbers or numbers +
hints.

Data can come from raw text (e.g. product
descriptions) or databases.

End run around data extraction.
– Use simple extractor to generate hints.

Open Problems: integration with keyword search.
Talk Outline

Taxonomy Integration

Searching with Numbers

Privacy-Preserving Data Mining
– Motivation
– Classification
– Associations
Growing Privacy Concerns

Popular Press:
– Economist: The End of Privacy (May 99)
– Time: The Death of Privacy (Aug 97)

Govt. legislation:
– European directive on privacy protection (Oct 98)
– Canadian Personal Information Protection Act (Jan 2001)

Special issue on internet privacy, CACM, Feb 99

S. Garfinkel, "Database Nation: The Death of
Privacy in 21st Century", O' Reilly, Jan 2000
Privacy Concerns (2)

Surveys of web users
– 17% privacy fundamentalists, 56% pragmatic
majority, 27% marginally concerned
(Understanding net users' attitude about online
privacy, April 99)
– 82% said having privacy policy would matter
(Freebies & Privacy: What net users think, July
99)
Technical Question

Fear:
– "Join" (record overlay) was the original sin.
– Data mining: new, powerful adversary?

The primary task in data mining: development of
models about aggregated data.

Can we develop accurate models without access to
precise information in individual data records?
Talk Outline

Taxonomy Integration

Searching with Numbers

Privacy-Preserving Data Mining
– Motivation
– Private Information Retrieval
– Classification (SIGMOD 2000, with R. Agrawal)
– Associations
Web Demographics

Volvo S40 website targets people in 20s
– Are visitors in their 20s or 40s?
– Which demographic groups like/dislike the
website?
Solution Overview
30 | 70K | ...
50 | 40K | ...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K | ...
Reconstruct
distribution
of Age
Reconstruct
distribution
of Salary
Data Mining
Algorithms
...
...
...
Model
Reconstruction Problem

Original values x1, x2, ..., xn
– from probability distribution X (unknown)

To hide these values, we use y1, y2, ..., yn
– from probability distribution Y

Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y

Estimate the probability distribution of X.
Intuition (Reconstruct single
point)

Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
d
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Intuition (Reconstruct single
point)

Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
D
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Reconstructing the
Distribution

Combine estimates of where point came from for all
the points:
– Gives estimate of original distribution.
1
0
A
g
e
9
0
Reconstruction:
Bootstrapping

fX0 := Uniform distribution

j := 0 // Iteration number

repeat
j
n
1
f
((
x

y
)

a
)
f
j 1
Y
i
i
X (a )
f
(
a
)
:

– x

n i 1  f (( x  y )  a ) f j (a )
X
 Y i i
– j := j+1
(Bayes' rule)

until (stopping criterion met)

Converges to maximum likelihood estimate.
– D. Agrawal & C.C. Aggarwal, PODS 2001.
Seems to work well!
Number of People
1200
1000
800
Original
Randomized
Reconstructed
600
400
200
0
20
60
Age
Recap: Why is privacy
preserved?

Cannot reconstruct individual values accurately.

Can only reconstruct distributions.
Talk Outline

Taxonomy Integration

Searching with Numbers

Privacy-Preserving Data Mining
– Motivation
– Private Information Retrieval
– Classification
– Associations (KDD 2002, with A. Evfimievski, R.
Agrawal & J. Gehrke)
Association Rules


Given:
– a set of transactions
– each transaction is a set of items
Association Rule: 30% of transactions that contain
Book1 and Book5 also contain Book20; 5% of
transactions contain these items.
– 30% : confidence of the rule.
– 5% : support of the rule.


Find all association rules that satisfy user-specified
minimum support and minimum confidence
constraints.
Can be used to generate recommendations.
Recommendations Overview
Alice
Book 1,
Book 11,
Book 21
Book 1,
Book 7,
Book 21
Recommendation
Service
Support Recovery
Associations
Bob
Book 5,
Book 25
Book 3,
Book 25
Recommendations
Private Information Retrieval

Retrieve 1 of n documents from a digital library
without the library knowing which document was
retrieved.

Trivial solution: Download entire library.

Can you do better?
– Yes, with multiple servers.
– Yes, with single server & computational privacy.

Problem introduced in [Chor et al, FOCS 95]
Uniform Randomization

Given a transaction,
– keep item with 20% probability,
– replace with a new random item with 80%
probability.

Appears to gives around 80% privacy…
– 80% chance that an item in the randomized
transaction was not in the original transaction.
Privacy Breach Example

10 M transactions of size 3 with 1000 items:
100,000 (1%)
have
{x, y, z}
0.23 = .008
800 transactions
99.99%

9,900,000 (99%)
have zero
items from {x, y, z}
6 * (0.8/1000)3
= 3 * 10-9
.03 transactions (<< 1)
0.01%
80% privacy “on average,” but not for all items!
Solution
“Where does a wise man hide a leaf? In the forest.
But what does he do if there is no forest?”
“He grows a forest to hide it in.”
G.K. Chesterton
Insert
Hide
No
many false items into each transaction.
true itemsets among false ones.
free lunch: Need more transactions to discover
associations.
Related Work

S. Rizvi, J. Haritsa, “Privacy-Preserving Association
Rule Mining”, VLDB 2002.

Protecting privacy across databases:
– Y. Lindell and B. Pinkas, “Privacy Preserving
Data Mining”, Crypto 2000.
– J. Vaidya and C.W. Clifton, “Privacy Preserving
Association Rule Mining in Vertically Partitioned
Data”, KDD 2002.
Summary

Have your cake and mine it too!
– Preserve privacy at the individual level, but still
build accurate models.
– Can do both classification & association rules.

Open Problems: Clustering, Lower bounds on
discoverability versus privacy, Faster algorithms, …
Slides available from ...
www.almaden.ibm.com/cs/people/srikant/talks
.html
Backup
Lowest Discoverable Support
|t| = 5,  = 50%
LDS vs. number of transactions
1.2
1-itemsets
2-itemsets
3-itemsets
1
LDS is s.t., when predicted, 0.8
is 4 away from zero.
0.6

Roughly, LDS is
proportional to 1
LDS, %

T
0.4
0.2
0
1
10
Number of transactions, millions
100
LDS vs. Breach Level
2.5
LDS, %
2
1.5
1
0.5
0
30
40
50
60
70
Privacy Breach Level, %
|t| = 5, |T| = 5 M
80
90
Basic 2-server Scheme

Each server returns
XOR of green bits.

Client XORs bits
returned by server.

Communication
complexity: O(n)
1
2
3
4
5
6
7
8
Sqrt(n) Algorithm

1
2

3
4

5
6
7
8

Each server returns bitwise XOR of specified
blocks.
Client XORs the 2 blocks
& selects desired bits.
Each block has sqrt(n)
elements => 4*sqrt(n)
communication
complexity.
Server computation time
still O(n)
Computationally Private IR

Use pseudo-random function + mask to generate
sets.

Quadratic residuosity.

Difficulty of deciding whether a small prime divides
(m)
– m: composite integer of unknown factorization
– (m): Euler totient fn, i.e., # of positive integers
<=m that are relatively prime to m.
Extensions

Retrieve documents (blocks), not bits.
– If n <= l, comm. complexity 4l.
– If n <= l2/4, comm. complexity 8l.

Lower communication complexity.

Select documents using keywords.

Protect data privacy.

Preprocessing to reduce computation time.

Computationally-private information retrieval with
single server.
Potential Privacy Breaches

Distribution is a spike.
– Example: Everyone is of age 40.

Some randomized values are only possible from a
given range.
– Example: Add U[-50,+50] to age and get 125 
True age is  75.
– Not an issue with Gaussian.
Potential Privacy Breaches (2)


Most randomized values in a given interval come
from a given interval.
– Example: 60% of the people whose randomized
value is in [120,130] have their true age in
[70,80].
– Implication: Higher levels of randomization will
be required.
Correlations can make previous effect worse.
– Example: 80% of the people whose randomized
value of age is in [120,130] and whose
randomized value of income is [...] have their
true age in [70,80].
Work in Statistical Databases

Provide statistical information without compromising
sensitive information about individuals (surveys:
AW89, Sho82)

Techniques
– Query Restriction
– Data Perturbation

Negative Results: cannot give high quality statistics
and simultaneously prevent partial disclosure of
individual information [AW89]
Statistical Databases:
Techniques

Query Restriction
– restrict the size of query result (e.g. FEL72, DDS79)
– control overlap among successive queries (e.g. DJL79)
– suppress small data cells (e.g. CO82)

Output Perturbation
– sample result of query (e.g. Den80)
– add noise to query result (e.g. Bec80)

Data Perturbation
– replace db with sample (e.g. LST83, LCL85, Rei84)
– swap values between records (e.g. Den82)
– add noise to values (e.g. TYW84, War65)
Statistical Databases:
Comparison

We do not assume original data is aggregated into
a single database.

Concept of reconstructing original distribution.
– Adding noise to data values problematic without
such reconstruction.
Download