Version April 21, 2011 Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer 33rd European Conference on Information Retrieval (ECIR'11) Database group University of Washington http://UniqueRecall.com Information Acquisition from Redundant Data Pareto principle (80-20 rule) e.g. business e.g. software e.g. health care Information acquisition e.g. web harvesting e.g. words in a corpus e.g. used first names 20% causes clients bugs patients 80% effect sales errors HC resources ? 20% data (instances) web data all words individual names ? 80% information (concepts) web information different words different names Motivating question: Can we learn 80% of the information, by looking at only 20% of the data? Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Information Acquisition from Redundant Data Information Dissemination Au "Unique" information Information Acquisition Information Retrieval, Information Information Extraction Integration A Available Data Redundancy distribution k B Retrieved and extracted data Recall r Bu Acquired information Expected sample distribution k Expected unique recall ru Three assumptions • no disambiguity in data • random sampling w/o replacement • very large data sets Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3 Outline • A horizontal sampling model • The role of power-laws • Real data & Discussion Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4 A Simple Balls-and-Urn Sampling Model frequency of i-th most often appearing information (color) Redundancy ki Redundancy ki 6 6 Data (# balls: a=15) 5 5 4 4 3 3 2 2 1 1 1 2 3 4 5 1 Information i (# colors: au=5) Redundancy Distribution k k=(6,3,3,2,1) Wolfgang Gatterbauer Sampled Data (# balls: b=3) 2 3 4 5 Sampled Information i (# Colors: bu=2) Recall r r=3/15=0.2 Unique recall ru=2/5=0.4 Sample Redundancy distribution k=(2,1) Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 5 A model for sampling in the Limit of large data sets 6 5 4 3 2 1 a=15 a3=0.4 1 2 3 4 5 k=(6,3,3,2,1) k2-3=3 6 5 4 3 2 1 a=30 a3=0.4 1 2 3 4 5 6 7 8 9 10 k=(6,6,3,3,3,3,2,2,1,1) k3-6=3 vertical perspective Wolfgang Gatterbauer 6 5 4 3 2 1 a∞ a3=0.4 1 2 3 4 5 a=(0.2,0.2,0.4,0,0,0.2) ... Rules of Thumb for Information Acquisition from Redundant Data horizontal perspective http://uniqueRecall.com 6 The Intuition for constant redundancy k lim a∞ r=0.5 Unique recall k=const=3 3 Indep. sampling with p=r 2 Expected sample distribution 2 1 ru 1 0 2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution ru = 1-(1-0.5)3=0.875 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 7 The Intuition for arbitrary redundancy distributions Stratified sampling Redundancy k a6= 0.2 lim a∞ 6 r=0.5 5 4 a3=0.4 3 a2= 0.2 2 2 a1= 0.2 1 0 0.2 ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] Wolfgang Gatterbauer 0.6 0.8 1 + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ] Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 8 A horizontal Perspective for Sampling r=0.5 6 Expected sample redundancy k1=3 5 4 3 2 1 0 au=5 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0.2 Wolfgang Gatterbauer 0.4 0.6 0.8 1 au=100 0 0 0.2 0.4 0.6 0.8 1 au=20 Horizontal layer of redundancy 1=0.8 0 0.2 0.4 Rules of Thumb for Information Acquisition from Redundant Data 0.6 0.8 bu=800 1 au=1000 http://uniqueRecall.com 9 Unique Recall for arbitrary redundancy distributions Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10 Outline • A horizontal sampling model • The role of power-laws • Real data & Discussion Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11 Three formulations of Power law distributions redundancy ki redundancy frequency ak Zipf-Mandelbrot Power-law probability Zipf [1932] Stumpf et al. [PNAS’05] complementary cumulative frequency k Pareto Mitzenmacher [IM’04] Adamic [TR’00] Commonly assumed to be different representations of the same distribution Mitzenmacher [Internet Math.’04] Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12 Unique Recall with Power laws log-log plot Wolfgang Gatterbauer For =1 Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 13 Unique Recall with Power laws For =1 Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 14 Invariants under Sampling Given our model: Which redundancy distribution remains invariant under sampling? Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 15 Invariant is a power-law! Hence, the tail of all power laws remains invariant under sampling ruk=k/k Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16 Also, the power law tail breaks in Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 17 Outline • A horizontal sampling model • The role of power-laws • Real data & Discussion Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18 Sampling from Real Data Incoming links for one domain Tag distribution on delicious.com Rule of Thumb 2: works, but only for small area Rule of Thumb 1: not good! Rule of Thumb 2: works! Rule of Thumb 1: perfect! Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19 Theory on Sampling from Power-Laws Stumpf et al. [PNAS’05] "Here, we show that random subnets sampled from scalefree networks are not themselves scale-free." Wolfgang Gatterbauer This paper "Here, we show that there is one powerlaw family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too. Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Some other related work Population sampling • goal: estimate size of population • e.g. mark-recapture animal sampling • sampling of small fraction w/ replacement Downey et al. [IJCAI’05] • urn model to estimate the probability that extracted information is correct. • random sampling with replacement Ipeirotis et al. [Sigmod’06] • decision framework to search or to crawl • random sampling without replacement with known population sizes Bar-Yossef, Gurevich [WWW’06] • biased methods to sample from search engine's index to estimate index size Stumpf et al. [PNAS’05] • show that random subnets sampled from scale-free networks are not scale-free Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Summary 1/2 Inf. Dissemination • A simple model of information acquisition from redundant data - no disambiguity - random sampling w/o replacement Au Information Inf. Acquisition IR & IE Inf. Integration A B Bu Available Retrieved Acquired Data Data information Redundancy Sample Recall r distribution k distribution k Unique recall ru ru(r) • Full analytic solution Normalized Redundancy distribution a - large data ruk(r) ruk=k/k Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22 Summary 2/2 3 different power-laws Unique recall for power-laws Invariant distribution Sampling from power-laws • Rule of thumb 1: - 80/20 40/20 - sensitive to exact power-law root • Rule of thumb 2: - power-law core remains invariant ruk r http://uniqueRecall.com Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23 backup 24 Geometric interpretation of k(, k, r) Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data BACKUP http://uniqueRecall.com 25 Information Acquisition from Redundant Data 3 pieces of data, containing 2 pieces of (“unique”) information* Data Information (instances) (concepts) e.g. used first names in a group individual names / different names e.g. words of a corpus: word appearances / vocabulary e.g. web harvesting: web data / web information Motivating question: Can we learn 80% of the information, by looking at only 20% of the data? *data interpreted as redundant representation of information Capurro, Hjørland [ARIST’03] Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26