Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk Background to OpenDOAR • Created in 2005 – Lists over 2320 repositories (2013-07-02) • Manually validated – High quality… – …but we didn’t like to talk about the record counts • Counts not updated after the initial entry – Unless prompted by users • Fixed in 2012 – Record counts updated about every 2 weeks http://www.opendoar.org/ Established counting methods • Manual inspection – Labour-intensive • Counting OAI-PMH record identifiers – Inefficient • Handling big files • Iterative – Unreliable • File size limits and timeouts – Inaccurate • Need to account for deleted records http://www.opendoar.org/ How difficult can it be? • SELECT COUNT(*) FROM repository; – Still fast even with added complexity – Statuses, Breakdown by date, etc. • The number is often there on the web page – Headline number, or – “x to y of z” tally, or – Adding up numbers on a “Browse by year” page http://www.opendoar.org/ OpenDOAR’s Strategy • • • • Avoid OAI-PMH whenever possible Use other m2m interfaces, if available/suitable Screen scrape numbers from web pages If all else fails, use manual methods • Counts for “full texts” as well, where possible http://www.opendoar.org/ Some examples… Generic n records Documents avec texte intégral 229181 http://www.opendoar.org/ Generic x to y of z counters DSpace Browse Counter is a special case Showing results 1 to 20 of 6727 http://www.opendoar.org/ DSpace totalCnt Add-on NCKUR中的社群 [40782/74662] [ 全文筆數/總筆數 ] - Generic Sum of List Counters EPrints count Browse List is a special case Add up the numbers in brackets EPrints V.3 Counter http://eprints.nonesuch.ac.uk/cgi/counter Number of items Generic Sum of Numbers Add up the numbers Generic HTML tag counting Count item tags in HTML source code Counting multiple pages • Separate pages per letter, document type, etc • Issues with Greenstone – lack of predictability http://www.opendoar.org/ OAI-PMH ListIdentifiers: Simple http:// ... /oai?verb=ListIdentifiers&metadataPrefix=oai_dc Count these No resumptionToken OAI-PMH ListIdentifiers: Iterative resumptionToken for blocks of identifiers <resumptionToken>193114FUS</resumptionToken> OAI-PMH completeListSize <resumptionToken completeListSize="89805" Bingo! Twelve count harvesting methods • Generic – – – – – • EPrints Generic n records Generic x to y of z counters Generic Sum of List Counters Generic HTML tag counting Generic Sum of Numbers • DSpace – DSpace Browse Counter – DSpace totalCnt Add-on – EPrints count Browse List – EPrints V.3 Counter • OAI-PMH ListIdentifiers – Simple – Iterative – completeListSize • Manual counting http://www.opendoar.org/ Efficiency of the methods Microseconds/Item OAI-PMH Iterative count Generic HTML tag counting Big files OAI-PMH Simple count DSpace Browse Counter Iterative OAI-PMH so much slower Generic x to y of z counter DSpace totalCnt Add-on EPrints V.3 Counter Small files Generic Sum of List Counters EPrints count Browse List OAI-PMH completeListSize Generic n records Generic Sum of Numbers 0 5000 10000 15000 20000 25000 Relative Frequency of Methods 3% 0% 5% DSpace Browse Counter 0% 8% DSpace totalCnt Add-on EPrints V.3 Counter EPrints count Browse List 41% OAI-PMH completeListSize OAI-PMH Simple count OAI-PMH Iterative count 18% Generic n records Generic Sum of List Counters Generic HTML tag counting 0% Generic x to y of z counter 1% Generic Sum of Numbers 6% Manual counting 4% 3% 11% Ugent Numbers galore DSpace and EPrints Easily scrapeable counts http://www.opendoar.org/ Count harvesting issues • No counts visible or harvestable • Static counts – often approx. – e.g. “over 2m items” • Connectivity issues – Infrastructure limitations – e.g. heavy internet traffic – HTTP 401 (unauthorised) & 403 (forbidden) errors • Data hidden in include files (e.g. JavaScript) – Not visible in View Source code • No direct URL known for the pages with counts – Only accessible to human navigators • Remodelled websites – requiring updated settings http://www.opendoar.org/ Help OpenDOAR count your repository • Display record counts on your home page – Using distinctive wording & highlighting – Ideally in <div id="[ID]"> or <span id="[ID]"> tags • Ensure numbers can be seen in View Source code • Ensure pages & files are not blocked to robots – Grant read-only access if necessary • Implement OAI-PMH properly – Return ListIdentifiers in chunks – not one big file – Include completeListSize in the resumptionToken • Tell us about any changes, so we can update settings http://www.opendoar.org/ Ideas for the Future • Comparing counts from OpenDOAR & ROAR – E.g. Nottm ePrints: – E.g. HAL-Inserm: 1,239 < 1,277 7,498 > 2,773 • OpenDOAR – Growth charts – Full text counts • Extending OAI-PMH – Statistical features – Trial PSH http://www.opendoar.org/