Counting on OpenDOAR

advertisement
Counting on OpenDOAR
Peter Millington
SHERPA Technical Development Officer
CRC, University of Nottingham
peter.millington@nottingham.ac.uk
Background to OpenDOAR
• Created in 2005
– Lists over 2320 repositories (2013-07-02)
• Manually validated
– High quality…
– …but we didn’t like to talk about the record counts
• Counts not updated after the initial entry
– Unless prompted by users
• Fixed in 2012
– Record counts updated about every 2 weeks
http://www.opendoar.org/
Established counting methods
• Manual inspection
– Labour-intensive
• Counting OAI-PMH record identifiers
– Inefficient
• Handling big files
• Iterative
– Unreliable
• File size limits and timeouts
– Inaccurate
• Need to account for deleted records
http://www.opendoar.org/
How difficult can it be?
• SELECT COUNT(*) FROM repository;
– Still fast even with added complexity
– Statuses, Breakdown by date, etc.
• The number is often there on the web page
– Headline number, or
– “x to y of z” tally, or
– Adding up numbers on a “Browse by year” page
http://www.opendoar.org/
OpenDOAR’s Strategy
•
•
•
•
Avoid OAI-PMH whenever possible
Use other m2m interfaces, if available/suitable
Screen scrape numbers from web pages
If all else fails, use manual methods
• Counts for “full texts” as well, where possible
http://www.opendoar.org/
Some examples…
Generic n records
Documents avec texte intégral
229181
http://www.opendoar.org/
Generic x to y of z counters
DSpace Browse Counter is a special case
Showing results 1 to 20 of 6727
http://www.opendoar.org/
DSpace totalCnt Add-on
NCKUR中的社群 [40782/74662] [ 全文筆數/總筆數 ]

-
Generic Sum of List Counters
EPrints count Browse List is a special case
Add up the numbers
in brackets
EPrints V.3 Counter
http://eprints.nonesuch.ac.uk/cgi/counter
Number
of items
Generic Sum of Numbers
Add up the numbers
Generic HTML tag counting
Count item tags in
HTML source code
Counting multiple pages
• Separate pages per letter, document type, etc
• Issues with Greenstone – lack of predictability
http://www.opendoar.org/
OAI-PMH ListIdentifiers: Simple
http:// ... /oai?verb=ListIdentifiers&metadataPrefix=oai_dc
Count
these
No resumptionToken
OAI-PMH ListIdentifiers: Iterative
resumptionToken
for blocks of
identifiers
<resumptionToken>193114FUS</resumptionToken>
OAI-PMH completeListSize
<resumptionToken completeListSize="89805"
Bingo!
Twelve count harvesting methods
• Generic
–
–
–
–
–
• EPrints
Generic n records
Generic x to y of z counters
Generic Sum of List Counters
Generic HTML tag counting
Generic Sum of Numbers
• DSpace
– DSpace Browse Counter
– DSpace totalCnt Add-on
– EPrints count Browse List
– EPrints V.3 Counter
• OAI-PMH ListIdentifiers
– Simple
– Iterative
– completeListSize
• Manual counting
http://www.opendoar.org/
Efficiency of the methods
Microseconds/Item
OAI-PMH Iterative count
Generic HTML tag counting
Big files
OAI-PMH Simple count
DSpace Browse Counter
Iterative OAI-PMH
so much slower
Generic x to y of z counter
DSpace totalCnt Add-on
EPrints V.3 Counter
Small files
Generic Sum of List Counters
EPrints count Browse List
OAI-PMH completeListSize
Generic n records
Generic Sum of Numbers
0
5000
10000
15000
20000
25000
Relative Frequency of Methods
3% 0%
5%
DSpace Browse Counter
0%
8%
DSpace totalCnt Add-on
EPrints V.3 Counter
EPrints count Browse List
41%
OAI-PMH completeListSize
OAI-PMH Simple count
OAI-PMH Iterative count
18%
Generic n records
Generic Sum of List Counters
Generic HTML tag counting
0%
Generic x to y of z counter
1%
Generic Sum of Numbers
6%
Manual counting
4%
3%
11%
Ugent
Numbers galore
DSpace and EPrints
Easily scrapeable counts
http://www.opendoar.org/
Count harvesting issues
• No counts visible or harvestable
• Static counts – often approx. – e.g. “over 2m items”
• Connectivity issues
– Infrastructure limitations – e.g. heavy internet traffic
– HTTP 401 (unauthorised) & 403 (forbidden) errors
• Data hidden in include files (e.g. JavaScript)
– Not visible in View  Source code
• No direct URL known for the pages with counts
– Only accessible to human navigators
• Remodelled websites – requiring updated settings
http://www.opendoar.org/
Help OpenDOAR count your repository
• Display record counts on your home page
– Using distinctive wording & highlighting
– Ideally in <div id="[ID]"> or <span id="[ID]"> tags
• Ensure numbers can be seen in View  Source code
• Ensure pages & files are not blocked to robots
– Grant read-only access if necessary
• Implement OAI-PMH properly
– Return ListIdentifiers in chunks – not one big file
– Include completeListSize in the resumptionToken
• Tell us about any changes, so we can update settings
http://www.opendoar.org/
Ideas for the Future
• Comparing counts from OpenDOAR & ROAR
– E.g. Nottm ePrints:
– E.g. HAL-Inserm:
1,239 < 1,277
7,498 > 2,773
• OpenDOAR
– Growth charts
– Full text counts
• Extending OAI-PMH
– Statistical features
– Trial PSH
http://www.opendoar.org/
Download