Rules of Thumb for Information Acquisition from Large and Redundant Data (presentation)

Version April 21, 2011 Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer 33rd European Conference on Information Retrieval (ECIR'11) Database group University of Washington http://UniqueRecall.com Information Acquisition from Redundant Data Pareto principle (80-20 rule) e.g. business e.g. software e.g. health care Information acquisition e.g. web harvesting e.g. words in a corpus e.g. used first names 20% causes clients bugs patients 80% effect sales errors HC resources ? 20% data (instances) web data all words individual names ? 80% information (concepts) web information different words different names Motivating question: Can we learn 80% of the information, by looking at only 20% of the data? Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Information Acquisition from Redundant Data Information Dissemination Au "Unique" information Information Acquisition Information Retrieval, Information Information Extraction Integration A Available Data Redundancy distribution k B Retrieved and extracted data Recall r Bu Acquired information Expected sample distribution k Expected unique recall ru Three assumptions • no disambiguity in data • random sampling w/o replacement • very large data sets Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3 Outline • A horizontal sampling model • The role of power-laws • Real data & Discussion Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4 A Simple Balls-and-Urn Sampling Model frequency of i-th most often appearing information (color) Redundancy ki Redundancy ki 6 6 Data (# balls: a=15) 5 5 4 4 3 3 2 2 1 1 1 2 3 4 5 1 Information i (# colors: au=5) Redundancy Distribution k k=(6,3,3,2,1) Wolfgang Gatterbauer Sampled Data (# balls: b=3) 2 3 4 5 Sampled Information i (# Colors: bu=2) Recall r r=3/15=0.2 Unique recall ru=2/5=0.4 Sample Redundancy distribution k=(2,1) Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 5 A model for sampling in the Limit of large data sets 6 5 4 3 2 1 a=15 a3=0.4 1 2 3 4 5 k=(6,3,3,2,1) k2-3=3 6 5 4 3 2 1 a=30 a3=0.4 1 2 3 4 5 6 7 8 9 10 k=(6,6,3,3,3,3,2,2,1,1) k3-6=3 vertical perspective Wolfgang Gatterbauer 6 5 4 3 2 1 a∞ a3=0.4 1 2 3 4 5 a=(0.2,0.2,0.4,0,0,0.2) ... Rules of Thumb for Information Acquisition from Redundant Data horizontal perspective http://uniqueRecall.com 6 The Intuition for constant redundancy k lim a∞ r=0.5 Unique recall k=const=3 3  Indep. sampling with p=r 2 Expected sample distribution 2 1 ru 1 0 2=(3 choose 2)0.52(1-0.5)1=0.375  Binomial distribution ru = 1-(1-0.5)3=0.875 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 7 The Intuition for arbitrary redundancy distributions Stratified sampling Redundancy k a6= 0.2 lim a∞ 6 r=0.5 5 4 a3=0.4 3 a2= 0.2 2 2 a1= 0.2 1 0 0.2 ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] Wolfgang Gatterbauer 0.6 0.8 1 + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ] Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 8 A horizontal Perspective for Sampling r=0.5 6 Expected sample redundancy k1=3 5 4 3 2 1 0 au=5 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0.2 Wolfgang Gatterbauer 0.4 0.6 0.8 1 au=100 0 0 0.2 0.4 0.6 0.8 1 au=20 Horizontal layer of redundancy 1=0.8 0 0.2 0.4 Rules of Thumb for Information Acquisition from Redundant Data 0.6 0.8 bu=800 1 au=1000 http://uniqueRecall.com 9 Unique Recall for arbitrary redundancy distributions Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10 Outline • A horizontal sampling model • The role of power-laws • Real data & Discussion Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11 Three formulations of Power law distributions redundancy ki redundancy frequency ak Zipf-Mandelbrot Power-law probability Zipf [1932] Stumpf et al. [PNAS’05] complementary cumulative frequency k Pareto Mitzenmacher [IM’04] Adamic [TR’00] Commonly assumed to be different representations of the same distribution Mitzenmacher [Internet Math.’04] Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12 Unique Recall with Power laws log-log plot Wolfgang Gatterbauer For =1 Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 13 Unique Recall with Power laws For =1 Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 14 Invariants under Sampling Given our model: Which redundancy distribution remains invariant under sampling? Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 15 Invariant is a power-law! Hence, the tail of all power laws remains invariant under sampling ruk=k/k Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16 Also, the power law tail breaks in Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 17 Outline • A horizontal sampling model • The role of power-laws • Real data & Discussion Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18 Sampling from Real Data Incoming links for one domain Tag distribution on delicious.com Rule of Thumb 2: works, but only for small area Rule of Thumb 1: not good! Rule of Thumb 2: works! Rule of Thumb 1: perfect! Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19 Theory on Sampling from Power-Laws Stumpf et al. [PNAS’05] "Here, we show that random subnets sampled from scalefree networks are not themselves scale-free." Wolfgang Gatterbauer This paper "Here, we show that there is one powerlaw family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too. Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Some other related work Population sampling • goal: estimate size of population • e.g. mark-recapture animal sampling • sampling of small fraction w/ replacement Downey et al. [IJCAI’05] • urn model to estimate the probability that extracted information is correct. • random sampling with replacement Ipeirotis et al. [Sigmod’06] • decision framework to search or to crawl • random sampling without replacement with known population sizes Bar-Yossef, Gurevich [WWW’06] • biased methods to sample from search engine's index to estimate index size Stumpf et al. [PNAS’05] • show that random subnets sampled from scale-free networks are not scale-free Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Summary 1/2 Inf. Dissemination • A simple model of information acquisition from redundant data - no disambiguity - random sampling w/o replacement Au Information Inf. Acquisition IR & IE Inf. Integration A B Bu Available Retrieved Acquired Data Data information Redundancy Sample Recall r distribution k distribution k Unique recall ru ru(r) • Full analytic solution Normalized Redundancy distribution a - large data ruk(r) ruk=k/k Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22 Summary 2/2 3 different power-laws Unique recall for power-laws Invariant distribution Sampling from power-laws • Rule of thumb 1: - 80/20  40/20 - sensitive to exact power-law root • Rule of thumb 2: - power-law core remains invariant ruk  r http://uniqueRecall.com Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23 backup 24 Geometric interpretation of k(, k, r) Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data BACKUP http://uniqueRecall.com 25 Information Acquisition from Redundant Data 3 pieces of data, containing 2 pieces of (“unique”) information* Data Information (instances) (concepts) e.g. used first names in a group individual names / different names e.g. words of a corpus: word appearances / vocabulary e.g. web harvesting: web data / web information Motivating question: Can we learn 80% of the information, by looking at only 20% of the data? *data interpreted as redundant representation of information Capurro, Hjørland [ARIST’03] Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26

Rules of Thumb for Information Acquisition from Large and Redundant Data (presentation)

Related documents

Products

Support

Rules of Thumb for Information Acquisition from Large and Redundant Data (presentation)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib