Searching the Web Junghoo Cho UCLA Computer Science 1 Information Galore Biblio sever Legacy database Plain text files 2 Information Overload Problem 3 Solution Indexing approach – Google, Excite, AltaVista Integration approach – MySimon, BizRate 4 Indexing Approach Central Index 5 Challenges Page selection and download – What page to download? Page and index update – How to update pages? Page ranking – What page is “important” or “relevant”? Scalability 6 Integration Approach Mediator Wrapper Source 1 Wrapper Source 2 Wrapper Source n 7 Challenges Heterogeneous sources – Different data models: relational, object-oriented – Different schemas and representations: “Keanu Reeves” or “Reeves, K.” etc. Limited query capabilities Mediator caching 8 Focus of the Talk Indexing approach How to maintain pages up-to-date? 9 Outline of This Talk How can we maintain pages fresh? How does the Web change? What do we mean by “fresh” pages? How should we refresh pages? 10 Web Evolution Experiment How often does a Web page change? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How do we model Web changes? 11 Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) – identified 400 sites with highest “PageRank” – contacted administrators 720,000 pages collected – 3,000 pages from each site daily – start at root, visit breadth first (get new & old pages) – ran only 9pm - 6am, 10 seconds between site requests 12 Average Change Interval 0.35 fraction of pages 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1day 1day1week 1week1month4months 1month 4months average change interval 13 Change Interval – By Domain fraction of pages 0.6 0.5 0.4 com netorg edu gov 0.3 0.2 0.1 0 1day 1day1week 1week1month 1month4months 4months average change interval 14 Modeling Web Evolution Poisson process with rate T is time to next event fT (t) = e- t (t > 0) 15 Change Interval of Pages fraction of changes with given interval for pages that change every 10 days on average Poisson model interval in days 16 Change Metrics Freshness – Freshness of element ei at time t is F ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise 1 F( S ; t ) = N N F( e ; t ) i=1 ei ei ... Freshness of the database S at time t is database ... web i (Assume “equal importance” of pages) 17 Change Metrics Age – Age of element ei at time t is 0 if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = 1 N N A( e ; t ) i=1 i web database ei ei ... A( ei ; t ) = ... (Assume “equal importance” of pages) 18 Change Metrics F(ei) Time averages: 1 0 time A(ei) 1 t F (ei ) lim F (ei ; t ) dt t t 0 1 t F ( S ) lim F ( S ; t ) dt t t 0 0 time update refresh 19 Trick Question Two page database e1 changes daily e2 changes once a week Can visit one page per week How should we visit pages? – – – – – e1 e1 e2 e2 database web e1 e2 e1 e2 e1 e2 e1 e2... [uniform] e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional] e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... ? 20 Proportional Often Not Good! Visit fast changing e1 get 1/2 day of freshness Visit slow changing e2 get 1/2 week of freshness Visiting e2 is a better deal! 21 Optimal Refresh Frequency Problem Given 1 , 2 , ..., N and f , find f1 , f 2 ,... , f N N f fi / N i 1 that maximize 1 F (S ) N N F (ei ) i 1 22 Optimal Refresh Frequency • Shape of curve is the same in all cases • Holds for any change frequency distribution 23 Optimal Refresh for Age • Shape of curve is the same in all cases • Holds for any change frequency distribution 24 Comparing Policies Proportional Uniform Optimal Freshness Age 0.12 400 days 0.57 5.6 days 0.62 4.3 days Based on Statistics from experiment and revisit frequency of every month 25 Not Every Page is Equal! Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day 1 F ( S ) 1 F (e1 ) 2 F (e2 ) 3 In general, N F ( S ) wi F (ei ) i 1 N w i 1 i 26 Weighted Freshness f w=2 w=1 27 Change Frequency Estimation How to estimate change frequency? – – – – Naïve Estimator: X/T X: number of detected changes T: monitoring period 2 changes in 10 days: 0.2 times/day Incomplete change history Page changed 1 day Page visited Change detected 28 Improved Estimator Based on the Poisson model N 2 f log N X 1 – X: number of detected changes – N: number of accesses – f : access frequency 3 changes in 10 days: 0.36 times/day Accounts for “missed” changes 29 Improvement Significant? Application to a Web crawler – Visit pages once every week for 5 weeks – Estimate change frequency – Adjust revisit frequency based on the estimate » Uniform: do not adjust » Naïve: based on the naïve estimator » Ours: based on our improved estimator 30 Improvement from Our Estimator Detected changes Ratio to uniform Uniform 2,147,589 100% Naïve 4,145,582 193% Ours 4,892,116 228% (9,200,000 visits in total) 31 Summary Information overload problem – Indexing approach – Integration approach Page update – – – – Web evolution experiment Change metric Refresh policy Frequency estimator 32 Research Opportunity Efficient query processing? Automatic source discovery? Automatic data extraction? 33 Web Archive Project Can we store the history of the Web? – Web is ephemeral – Study of the Evolution of the Web Challenges – – – – Update policy? Compression? New storage structure? New index structure? 34 The End Thank you for your attention For more information visit http://www.cs.ucla.edu/~cho/ 35