Document 17527169

advertisement
Caltech Theses Collection
Usage Analysis
Ed Sponsler
George Porter
Betsy Coles
California Institute of Technology
Library System
Three Kinds of Lies
• White Lies
• Damned Lies
• Statistics
The Devil’s in the Data’s
Details
Examinig the Data’s
Details
• Study the data: What created it?
Human? Computer? What does it
mean?
• WRONG: How can the data
address my questions?
• RIGHT: What questions can the
data address?
Let’s Put Some Honesty
into Statistics
Caltech Theses Facts
• First Digital Deposit: July, 2001
• Number of Theses: 1208
• Software Used: VT ETDdb (but not
for much longer)
• Campus Mandate: June, 2002
• Defense Date Range: 1922 to
present
Caltech Theses Statistics
•
•
•
•
Data Source: Apache Web Logs
What is an access?
What can be ignored and why?
What do human v robot accesses
look like?
• What is a referrer? User Agent?
Host IP? Requested Object?
Apache Combined Log Format
63.89.199.36 - - [21/Jul/2003:12:53:01 -0700] "GET
/etd/available/etd-12182002-190040/unrestricted/thesis.pdf
HTTP/1.1" 200 15767
"http://etd.caltech.edu/etd/available/etd-2182002-190040/"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET
CLR 1.0.3705)"
DeDupe
The dedupe filter ensures that a
host may access a thesis only one
time. Duplicate attempts are
ignored, even if the request is for a
different file from the same thesis,
such as a different Chapter.
DeDupe
The result of the dedupe filter is
an access_log containing at most
one log entry for each unique host
that has accessed any file of a
given thesis.
DeDupe Data Structure
Theses ID
etd-3493
etd-1139
etd-944
Host IP
131.212.13.22
124.24.21.1
145.46.55.6
access_log
131.212.13.22 - - [21/Jul/2003:12
124.24.21.1 - - [12/Aug/2003:15
Host IP
145.46.55.6 - - [05/Sep/2003:05
131.212.13.22
133.25.5.12
154.21.78.9
131.212.13.22 - - [20/Sep/2003:04
Host IP
154.21.78.9 - - [03/Oct/2003:09
131.215.12.22
133.42.3.99
101.24.21.99
131.215.12.22 - - [05/Janl/2004:02
133.25.5.12 - - [28/Sep/2003:11
133.42.3.99 - - [09/Jan/2004:07
101.24.21.99 - - [14/Feb/2004:01
DeDupe Processing
2500000
2000000
1500000
Apache Log Entries
1000000
500000
0
Before
After
Apache Status Codes
OK
Partial Content
Not Modified
Forbidden
Not Found
User Agents
Internet Explorer
Netscape
Googlebot
Other Bots
User Agents
Internet Explorer 60% Known Human Users 71%
Netscape
11%
Googlebot
Other
14% Bots/Harvesters/Other 29%
15%
Search Servers
Google
Yahoo
MSN
AOL Netfind
Ask Jeeves
Other
PDF Downloads from
7/1/2001 - 5/31/2004
Country of Origin Report
GeoIP database contains IP
blocks and their country of
origin
More useful and complete than
top level domain names
(.edu, .de, .uk, etc)
Geographic Analysis
153 countries represented
United States
China
Germany
United Kingdom
Canada
India
Japan
France
Italy
Taiwan
Korea
Spain
Australia
Netherlands
Iran
Malaysia
Hong Kong
Turkey
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76294
7943
4763
4646
3918
3328
3271
2887
2066
2063
1639
1300
1249
1239
1208
1160
1007
961
Brazil
Poland
Singapore
Russian Fed.
Switzerland
Sweden
Israel
Belgium
Mexico
Thailand
Egypt
Greece
Romania
Vietnam
Indonesia
Portugal
Finland
Philippines
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
860
853
847
812
810
759
743
735
724
648
542
511
480
455
451
438
419
418
Most Popular Theses
Count
3322
3199
3174
2457
2153
2120
2098
2073
1959
1848
1675
1614
Defense Date
2000-10-23
2002-08-07
2002-07-16
2001-10-23
2002-10-02
2002-09-25
2001-05-18
2002-10-04
2002-11-05
2003-01-14
2002-08-14
2002-05-02
Count
1486
1378
1304
1296
1176
1134
1130
1124
1123
1091
1087
Defense Date
2002-09-04
2003-09-02
2001-02-09
2003-05-15
2003-05-15
2001-05-07
2002-01-16
2001-03-08
2003-06-02
2001-01-19
2003-03-20
Most Popular Theses
Defense Date
Title
(>1000 downloads)
2000-10-23
2002-08-07
2002-07-16
2001-10-23
2002-10-02
Blocking Adhesion to Cell and Tissue Surfaces
via Steric Stabilization with Graft Copolymers
containing Poly(Ethylene Glycol) and
Phenylboronic Acid
Electrochemical Sensors Based on DNAMediated Charge Transport Chemistry
Effects of Surface Modification on Charge-Carrier
Dynamics at Semiconductor Interfaces
I. Seafloor Morphology of the Osbourn Trough
and Kermadec Trench and II. Multiscale
Dynamics of Subduction Zones
I. Structure-Function Analysis of the
Mechanosensitive Channel of Large
Conductance. II. Design of Novel Magnetic
Materials using Crystal Engineering.
Most Popular Theses
Defense Date
Title
2002-09-25
Modeling a Hox Gene Network: Stochastic
Simulation with Experimental Perturbation
All-Optical Logic Circuits based on the
Polarization Properties of Non-Degenerate FourWave Mixing
Site-specific incorporation of synthetic amino
acids into functioning ion channels
Impact-Ionization Mass Spectrometry of Cosmic
Dust
Force-Detected Nuclear Magnetic Resonance
Independent of Field Gradients
Fast, High-Order Methods for Scattering by
Inhomogeneous MediaNeural dynamics underlying complex behavior in
a songbird
Spectroscopic Characterization of DNA-mediated
Charge Transfer
2001-05-18
2002-10-04
2002-11-05
2003-01-14
2002-08-14
2002-05-02
2002-09-04
Most Popular Theses
Defense Date
Title
2003-09-02
Protein Engineering Through in vivo Incorporation
of Phenylalanine Analogs
Synthesis, Passivation and Charging of Silicon
Nanocrystals
Sensitizer-linked substrates as probes of heme
enzyme structure and catalysis
Mirror Thermal Noise in Interferometric
Gravitational Wave Detectors
Analysis and Design of Turbo-like Codes
Computational Enzyme Design
An Investigation of Ion Engine Erosion by Low
Energy Sputtering
Laboratory Evolution of Cytochrome P450
Peroxygenase Activity
Passive Hypervelocity Boundary Layer Control
Using an Acoustically Absortive Surface
Mapping the cytochrome c folding landscape
2001-02-09
2003-05-15
2003-05-15
2001-05-07
2002-01-16
2001-03-08
2003-06-02
2001-01-19
2003-03-20
Human / Robot Split
Human activity identified by
‘MSIE’ or ‘Mozilla’
In the User Agent field of the
apache_log
Referrers by Human Use
MSIE | Mozilla
•
•
•
•
•
•
etd.caltech.edu
www.google.com
search.yahoo.com
www.google.de
all others
492 total referrers
33%
32%
8%
3%
<2% (each)
Most Active Robots
Since April, 2004
Googlebot
Googlebot/Test
TurnitinBot
Wget
msnbot
DA
Contype
ia_archiver
FAST-WebCrawler
NPBot
NetAnts
|
|
|
|
|
|
|
|
|
|
|
3524
1100
362
252
162
41
36
33
18
16
16
Summary
• Keep Statistics Honest:
understand and scrub your data
before analysis
• Google is key for discovery
• Theses are popular because they
are new and have useful content
Next Steps
• Compare download frequencies,
not just totals
• Create local IP -> domain name
database
• Adapt DeDupe to CODA EPrints
Archives
Caltech Library System’s
Online Digital Archives
Theses
http://etd.caltech.edu
All Archives
http://coda.caltech.edu
Download