Rules of Thumb for Information Acquisition from Large and Redundant Data (presentation)

advertisement
Version April 21, 2011
Rules of Thumb
for Information Acquisition
from Large and Redundant Data
Wolfgang Gatterbauer
33rd European Conference on Information Retrieval (ECIR'11)
Database group
University of Washington
http://UniqueRecall.com
Information Acquisition from Redundant Data
Pareto principle (80-20 rule)
e.g. business
e.g. software
e.g. health care
Information acquisition
e.g. web harvesting
e.g. words in a corpus
e.g. used first names
20% causes
clients
bugs
patients
80% effect
sales
errors
HC resources
? 20% data
(instances)
web data
all words
individual names
? 80% information
(concepts)
web information
different words
different names
Motivating question:
Can we learn 80% of the information,
by looking at only 20% of the data?
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
Information Acquisition from Redundant Data
Information
Dissemination
Au
"Unique"
information
Information Acquisition
Information Retrieval,
Information
Information Extraction
Integration
A
Available
Data
Redundancy
distribution k
B
Retrieved and
extracted data
Recall r
Bu
Acquired
information
Expected sample
distribution k
Expected unique recall ru
Three assumptions
• no disambiguity in data
• random sampling w/o replacement
• very large data sets
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
3
Outline
• A horizontal sampling model
• The role of power-laws
• Real data & Discussion
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
4
A Simple Balls-and-Urn Sampling Model
frequency of i-th most often appearing information (color)
Redundancy ki
Redundancy
ki
6
6
Data
(# balls: a=15)
5
5
4
4
3
3
2
2
1
1
1
2
3
4
5
1
Information i
(# colors: au=5)
Redundancy
Distribution k
k=(6,3,3,2,1)
Wolfgang Gatterbauer
Sampled Data
(# balls: b=3)
2
3
4
5
Sampled Information i
(# Colors: bu=2)
Recall r
r=3/15=0.2
Unique recall ru=2/5=0.4
Sample Redundancy
distribution k=(2,1)
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
5
A model for sampling in the Limit of large data sets
6
5
4
3
2
1
a=15
a3=0.4
1 2 3 4 5
k=(6,3,3,2,1)
k2-3=3
6
5
4
3
2
1
a=30
a3=0.4
1 2 3 4 5 6 7 8 9 10
k=(6,6,3,3,3,3,2,2,1,1)
k3-6=3
vertical perspective
Wolfgang Gatterbauer
6
5
4
3
2
1
a∞
a3=0.4
1 2 3 4 5
a=(0.2,0.2,0.4,0,0,0.2)
...
Rules of Thumb for Information Acquisition from Redundant Data
horizontal perspective
http://uniqueRecall.com
6
The Intuition for constant redundancy k
lim
a∞
r=0.5
Unique recall
k=const=3
3
 Indep. sampling with p=r
2
Expected sample distribution
2
1
ru 1
0
2=(3 choose 2)0.52(1-0.5)1=0.375
 Binomial distribution
ru = 1-(1-0.5)3=0.875
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
7
The Intuition for arbitrary redundancy distributions
Stratified sampling
Redundancy k
a6= 0.2
lim
a∞
6
r=0.5
5
4
a3=0.4
3
a2= 0.2
2 2
a1= 0.2
1
0
0.2
ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ]
Wolfgang Gatterbauer
0.6
0.8
1
+ a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
8
A horizontal Perspective for Sampling
r=0.5
6
Expected sample
redundancy k1=3
5
4
3
2
1
0
au=5
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0.2
Wolfgang Gatterbauer
0.4
0.6
0.8
1
au=100
0
0
0.2
0.4
0.6
0.8
1
au=20
Horizontal layer of
redundancy 1=0.8
0
0.2
0.4
Rules of Thumb for Information Acquisition from Redundant Data
0.6
0.8
bu=800
1
au=1000
http://uniqueRecall.com
9
Unique Recall for arbitrary redundancy distributions
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
10
Outline
• A horizontal sampling model
• The role of power-laws
• Real data & Discussion
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
11
Three formulations of Power law distributions
redundancy ki
redundancy
frequency ak
Zipf-Mandelbrot
Power-law probability
Zipf [1932]
Stumpf et al. [PNAS’05]
complementary
cumulative
frequency k
Pareto
Mitzenmacher [IM’04]
Adamic [TR’00]
Commonly assumed to be different representations of the same distribution
Mitzenmacher [Internet Math.’04]
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
12
Unique Recall with Power laws
log-log plot
Wolfgang Gatterbauer
For =1
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
13
Unique Recall with Power laws
For =1
Rule of Thumb 1: When sampling 20% of data from a Power-law
distribution, we expect to learn less than 40% of the information
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
14
Invariants under Sampling
Given our model: Which redundancy distribution
remains invariant under sampling?
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
15
Invariant is a power-law! Hence, the tail of all
power laws remains invariant under sampling
ruk=k/k
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
16
Also, the power law tail breaks in
Rule of Thumb 2: When sampling data from a Power-law
then the core of the sample distribution follows
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
17
Outline
• A horizontal sampling model
• The role of power-laws
• Real data & Discussion
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
18
Sampling from Real Data
Incoming links for one domain
Tag distribution on delicious.com
Rule of Thumb 2:
works, but only for small area
Rule of Thumb 1:
not good!
Rule of Thumb 2:
works!
Rule of Thumb 1:
perfect!
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
19
Theory on Sampling from Power-Laws
Stumpf et al. [PNAS’05]
"Here, we show that random
subnets sampled from scalefree networks are not
themselves scale-free."
Wolfgang Gatterbauer
This paper
"Here, we show that there is one powerlaw family that is invariant under sampling,
and the core of other power-law function
remains invariant under sampling too.
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
Some other related work
Population sampling
• goal: estimate size of population
• e.g. mark-recapture animal sampling
• sampling of small fraction w/ replacement
Downey et al. [IJCAI’05]
• urn model to estimate the probability
that extracted information is correct.
• random sampling with replacement
Ipeirotis et al. [Sigmod’06]
• decision framework to search or to crawl
• random sampling without replacement
with known population sizes
Bar-Yossef, Gurevich
[WWW’06]
• biased methods to sample from search
engine's index to estimate index size
Stumpf et al. [PNAS’05]
• show that random subnets sampled from
scale-free networks are not scale-free
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
Summary 1/2
Inf. Dissemination
• A simple model of
information acquisition
from redundant data
- no disambiguity
- random sampling
w/o replacement
Au
Information
Inf. Acquisition
IR & IE
Inf. Integration
A
B
Bu
Available
Retrieved
Acquired
Data
Data
information
Redundancy
Sample
Recall r
distribution k
distribution k
Unique recall ru
ru(r)
• Full analytic
solution
Normalized
Redundancy
distribution a
- large data
ruk(r)
ruk=k/k
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
22
Summary 2/2
3 different power-laws
Unique recall for power-laws
Invariant distribution
Sampling from power-laws
• Rule of thumb 1:
- 80/20  40/20
- sensitive to exact
power-law root
• Rule of thumb 2:
- power-law core
remains invariant
ruk  r
http://uniqueRecall.com
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
23
backup
24
Geometric interpretation of k(, k, r)
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
BACKUP
http://uniqueRecall.com
25
Information Acquisition from Redundant Data
3 pieces of data, containing 2 pieces of (“unique”) information*
Data
Information
(instances)
(concepts)
e.g. used first names in a group
individual names / different names
e.g. words of a corpus:
word appearances / vocabulary
e.g. web harvesting:
web data / web information
Motivating question:
Can we learn 80% of the information,
by looking at only 20% of the data?
*data interpreted as redundant representation of information Capurro, Hjørland [ARIST’03]
Wolfgang Gatterbauer
Rules of Thumb for Information Acquisition from Redundant Data
http://uniqueRecall.com
26
Download