Informetrics, Webometrics and Web Use metrics Huimin Lu 10/21/2004 Outline History Article 1: Bibliometrics & WWW Article 2: Bibliometrics of the WWW Article 3: Authoritative Sources Article 4: ParaSite Conclusion History Term introduced by Pritchard in 1969. Pritchard’s explanation: “the application of mathematical and statistical methods to books and other media of communication”. A1: Bibliometrics and the World Wide Web By Don Turnbull Bibliometrics Bibliometric laws Apply bibliometric to WWW Metrics design A1: Bibliometrics Classic citation analysis Refined classic bibliometrics - Standard formula for impact: n journal citations / n citable articles published - Basic formula for immediacy index of influence: n citations received by article during the year / total number of citable articles published Bibliometric Coupling - Measure the number of references two papers have in common to test for similarity Cocitation Analysis - Measure the relations between cited documents Common Errors - multiple authors lost, self-citation, similar author names, human error, etc. A1: Bibliometric Laws Bradford’s Law of Scattering - clustering method: Ran (n from 0; a<1), sum = R/(1-a) Lotka’s Law - inverse square Zipf’s Law - familiar words with high frequency (nth word: k/n times) A1: Applying Bibliometric to Web Web surveys - Georgia Tech Graphics, Visualization, and Usability Web Surveys Web servers - Add programming logic - Inaccurate data gathered: skip standard procedures, miss state information between usage hits, server hits themselves don’t represent true usage. A1: Metrics Design Configure Web server to gather comprehensive metrics Manage log files - Enhence reliability: regular backup, store log file analysis results and logs, begin new logs timely, post results and log information for comparasion. - Log analysis tools: Analog, WWWStat, GetStats, Perl Scripts. - Standardization: Extended Log File Format by WWW Consortium Standards Committee Downie’s attempt analysis: user-based, request, byte-based Optimal Web content setup & External bibliometric gathering A2: Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace By Ray R. Larson Analysis of 30G Web pages collected by Inktomi “Web Crawler” Cocitation analysis using DEC AltaVista search engine A2: Growth and Usage of Web WWW A2: Cocitation Analysis of Web Attempt: Map the intellectual structure of Web Question: Can cocitation techniques be applied to charting the contents of cyberspace? A2: Methods Selection of core set of items for study Retrieval of cocitation frequency information Compilation of the raw cocitation frequency matrix Correlation analysis to convert the raw frequencies into correlation coefficients Multivariate analysis of the correlation matrix Interpretation of the resulting “map” and validation A2: Results A3: Authoritative Sources in a Hyperlinked Environment By Jon M. Kleinberg A new method for automatically extracting certain types of information about a hypermedia environment from its link structure. A3: Goal Types of query search and problem - Specific queries: scarcity problem - Broad-topic queries: abundance problem - Similar-page queries Synthesize the unreliable information contained in the presence of individual links to provide a set of authoritative pages relevant to an initial query. A3: Common Approaches Only S - Define S to be the top k pages indexed by AltaVista - Rank pages according to their in-degree S -> T - Define same root set S - Grow S to a larger base set T - Rank pages by their in-degree A3: Their Approach Extract small core sets of community of hubs and authorities from T Authoritative pages - A novel type of quality measure of the document in hypermedia by algorithmic means. - Large in-degree & considerable overlap in sets of pages that point to them Hub Pages - have links to multiple relevant authoritative pages A3: Algorithm and Output Method: Iteratively propagates “authority weight” and “hub weight” across links of the web graph, converging simultaneously to steady states for both types of weights Output: a pair of sets (X, Y) (X: a small set of authorities, Y: a small set of hubs) referred by authors as community of hubs and authorities Claim: authoritative pages can be identified as belonging to dense bipartite communities in the link graph of the WWW via their algorithm. A4: ParaSite: Mining Structural Information on the Web By Ellen Spertus Varieties of link information on the Web How the web differs from conventional hypertext How the links can be exploited to build useful applications A4: Classical Hypertext vs. Web Classical hypertext Web - links don’t cross site even document boundaries - links can cross site and document boundaries - documents limited to a single topic - multiple topics permitted in one web page - manual answers each question in exactly one place or in none - an answer could appear any number of times on the web - Hardly change - constantly changing A4: Mining Links Naïve Link Geometry - A useful technique for finding pages on a given set of topics Hypertext Links example - Categorized into upward, downward, crosswise, and outward Directory Links - Directory structure relation in pages in the absence of hypertext links Structure within a Page - Page can be considered a tree of nodes, each with attached text and links embedded in the text Other - Domain names, relationships between concepts represented by words and phrases, paths traveled through Web sites by visitors A4: Application Finding Moved Pages - Exploiting hyperlinks - Exploiting directory links Finding Related Pages - Collaborative filtering - When searching for a related page with similar pages got, ParaSite can find the page (A) that has maximum links to the pages user got and return other pages referneced by A. A Person Finder Conclusion World Wide Web information increase exponentially and Internet architecture turns to be more complicated. Applying bibliometrics to the Web will help us control and manage web information wisely. Example of Hypertext Link Back to hypertext link