Caltech Theses Collection Usage Analysis Ed Sponsler George Porter Betsy Coles California Institute of Technology Library System Three Kinds of Lies • White Lies • Damned Lies • Statistics The Devil’s in the Data’s Details Examinig the Data’s Details • Study the data: What created it? Human? Computer? What does it mean? • WRONG: How can the data address my questions? • RIGHT: What questions can the data address? Let’s Put Some Honesty into Statistics Caltech Theses Facts • First Digital Deposit: July, 2001 • Number of Theses: 1208 • Software Used: VT ETDdb (but not for much longer) • Campus Mandate: June, 2002 • Defense Date Range: 1922 to present Caltech Theses Statistics • • • • Data Source: Apache Web Logs What is an access? What can be ignored and why? What do human v robot accesses look like? • What is a referrer? User Agent? Host IP? Requested Object? Apache Combined Log Format 63.89.199.36 - - [21/Jul/2003:12:53:01 -0700] "GET /etd/available/etd-12182002-190040/unrestricted/thesis.pdf HTTP/1.1" 200 15767 "http://etd.caltech.edu/etd/available/etd-2182002-190040/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)" DeDupe The dedupe filter ensures that a host may access a thesis only one time. Duplicate attempts are ignored, even if the request is for a different file from the same thesis, such as a different Chapter. DeDupe The result of the dedupe filter is an access_log containing at most one log entry for each unique host that has accessed any file of a given thesis. DeDupe Data Structure Theses ID etd-3493 etd-1139 etd-944 Host IP 131.212.13.22 124.24.21.1 145.46.55.6 access_log 131.212.13.22 - - [21/Jul/2003:12 124.24.21.1 - - [12/Aug/2003:15 Host IP 145.46.55.6 - - [05/Sep/2003:05 131.212.13.22 133.25.5.12 154.21.78.9 131.212.13.22 - - [20/Sep/2003:04 Host IP 154.21.78.9 - - [03/Oct/2003:09 131.215.12.22 133.42.3.99 101.24.21.99 131.215.12.22 - - [05/Janl/2004:02 133.25.5.12 - - [28/Sep/2003:11 133.42.3.99 - - [09/Jan/2004:07 101.24.21.99 - - [14/Feb/2004:01 DeDupe Processing 2500000 2000000 1500000 Apache Log Entries 1000000 500000 0 Before After Apache Status Codes OK Partial Content Not Modified Forbidden Not Found User Agents Internet Explorer Netscape Googlebot Other Bots User Agents Internet Explorer 60% Known Human Users 71% Netscape 11% Googlebot Other 14% Bots/Harvesters/Other 29% 15% Search Servers Google Yahoo MSN AOL Netfind Ask Jeeves Other PDF Downloads from 7/1/2001 - 5/31/2004 Country of Origin Report GeoIP database contains IP blocks and their country of origin More useful and complete than top level domain names (.edu, .de, .uk, etc) Geographic Analysis 153 countries represented United States China Germany United Kingdom Canada India Japan France Italy Taiwan Korea Spain Australia Netherlands Iran Malaysia Hong Kong Turkey | | | | | | | | | | | | | | | | | | 76294 7943 4763 4646 3918 3328 3271 2887 2066 2063 1639 1300 1249 1239 1208 1160 1007 961 Brazil Poland Singapore Russian Fed. Switzerland Sweden Israel Belgium Mexico Thailand Egypt Greece Romania Vietnam Indonesia Portugal Finland Philippines | | | | | | | | | | | | | | | | | | 860 853 847 812 810 759 743 735 724 648 542 511 480 455 451 438 419 418 Most Popular Theses Count 3322 3199 3174 2457 2153 2120 2098 2073 1959 1848 1675 1614 Defense Date 2000-10-23 2002-08-07 2002-07-16 2001-10-23 2002-10-02 2002-09-25 2001-05-18 2002-10-04 2002-11-05 2003-01-14 2002-08-14 2002-05-02 Count 1486 1378 1304 1296 1176 1134 1130 1124 1123 1091 1087 Defense Date 2002-09-04 2003-09-02 2001-02-09 2003-05-15 2003-05-15 2001-05-07 2002-01-16 2001-03-08 2003-06-02 2001-01-19 2003-03-20 Most Popular Theses Defense Date Title (>1000 downloads) 2000-10-23 2002-08-07 2002-07-16 2001-10-23 2002-10-02 Blocking Adhesion to Cell and Tissue Surfaces via Steric Stabilization with Graft Copolymers containing Poly(Ethylene Glycol) and Phenylboronic Acid Electrochemical Sensors Based on DNAMediated Charge Transport Chemistry Effects of Surface Modification on Charge-Carrier Dynamics at Semiconductor Interfaces I. Seafloor Morphology of the Osbourn Trough and Kermadec Trench and II. Multiscale Dynamics of Subduction Zones I. Structure-Function Analysis of the Mechanosensitive Channel of Large Conductance. II. Design of Novel Magnetic Materials using Crystal Engineering. Most Popular Theses Defense Date Title 2002-09-25 Modeling a Hox Gene Network: Stochastic Simulation with Experimental Perturbation All-Optical Logic Circuits based on the Polarization Properties of Non-Degenerate FourWave Mixing Site-specific incorporation of synthetic amino acids into functioning ion channels Impact-Ionization Mass Spectrometry of Cosmic Dust Force-Detected Nuclear Magnetic Resonance Independent of Field Gradients Fast, High-Order Methods for Scattering by Inhomogeneous MediaNeural dynamics underlying complex behavior in a songbird Spectroscopic Characterization of DNA-mediated Charge Transfer 2001-05-18 2002-10-04 2002-11-05 2003-01-14 2002-08-14 2002-05-02 2002-09-04 Most Popular Theses Defense Date Title 2003-09-02 Protein Engineering Through in vivo Incorporation of Phenylalanine Analogs Synthesis, Passivation and Charging of Silicon Nanocrystals Sensitizer-linked substrates as probes of heme enzyme structure and catalysis Mirror Thermal Noise in Interferometric Gravitational Wave Detectors Analysis and Design of Turbo-like Codes Computational Enzyme Design An Investigation of Ion Engine Erosion by Low Energy Sputtering Laboratory Evolution of Cytochrome P450 Peroxygenase Activity Passive Hypervelocity Boundary Layer Control Using an Acoustically Absortive Surface Mapping the cytochrome c folding landscape 2001-02-09 2003-05-15 2003-05-15 2001-05-07 2002-01-16 2001-03-08 2003-06-02 2001-01-19 2003-03-20 Human / Robot Split Human activity identified by ‘MSIE’ or ‘Mozilla’ In the User Agent field of the apache_log Referrers by Human Use MSIE | Mozilla • • • • • • etd.caltech.edu www.google.com search.yahoo.com www.google.de all others 492 total referrers 33% 32% 8% 3% <2% (each) Most Active Robots Since April, 2004 Googlebot Googlebot/Test TurnitinBot Wget msnbot DA Contype ia_archiver FAST-WebCrawler NPBot NetAnts | | | | | | | | | | | 3524 1100 362 252 162 41 36 33 18 16 16 Summary • Keep Statistics Honest: understand and scrub your data before analysis • Google is key for discovery • Theses are popular because they are new and have useful content Next Steps • Compare download frequencies, not just totals • Create local IP -> domain name database • Adapt DeDupe to CODA EPrints Archives Caltech Library System’s Online Digital Archives Theses http://etd.caltech.edu All Archives http://coda.caltech.edu