Sometimes, I just want to count things Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk http://crc.nottingham.ac.uk/ Actually, that’s a lie • Just give me numbers for OpenDOAR – No. of items in ~1,800 repositories – Growth rates – Number of full texts v metadata-only records • More generally (any database or resource) – No. of records in the database – No. of records by year, month, etc. – No. of records by category http://crc.nottingham.ac.uk/ How difficult can it be? • Screen scraping? – Uh-uh-uh • OAI-PMH – counting identifiers – BIG files – e.g. DSpace – Time out! – Iterative chunks – e.g. EPrints – Yawn – ‘completeListSize’ argument – If only… • ORE is no better – Whatever… • select count(*) from TABLE; – Duh! • So back to screen scraping – Sigh http://crc.nottingham.ac.uk/ It should be as easy as …one… • Simplicity • Single SQL SELECT statement – Anything more is too complex and so too slow • Single Call/File – No iteration • Single simple schema – XML (+ optional JSON, and other renditions) http://crc.nottingham.ac.uk/ …two… Target Performance - Rules of Two <= 0.2 seconds – SQL execution <= 2 seconds – Rendering the output file <= 20 – Data points http://crc.nottingham.ac.uk/ …three Maximum limits - Rules of Twenty (?) <= 2 seconds – SQL execution <= 20 seconds – Rendering the output file <= 200 – Data points http://crc.nottingham.ac.uk/ Actions speak louder than words • Protocol for Statistical Harvesting (PSH) – Base URL + verb + optional arguments • Specification & Examples – http://opendoar.org/demos/psh_prototype.php • Example Base URL: – http://opendoar.org/demos/psh.php http://crc.nottingham.ac.uk/ Simplest case - [base url]?verb=Count <psh> <responseDate>2011-02-11T00:05:26Z</responseDate> <request verb="Count"> http://www.opendoar.org/demos/psh.php </request> <Count countType="allItems"> <header> <setType /> <setSpec /> <setName /> <datestamp /> <numItems>1860</numItems> </header> </Count> </psh> http://crc.nottingham.ac.uk/ Optional Count Arguments • &countType – ‘units’ for counts – e.g. records, repositories, groups, genera, etc • &setType – some sort of category – e.g. subject, region, social class, etc. • &dateUnit – e.g. decade, year, month • &dateType – e.g. Date added, updated, performed, extinct, etc. http://crc.nottingham.ac.uk/ Breakdown by year added <psh> <responseDate>2011-02-11T00:36:24Z</responseDate> <request verb="Count">http://www.opendoar.org/demos/psh.php</request> <Count countType="allItems" dateType="dateAdded"> <header> <setType /> <setSpec /> <setName /> <datestamp>2008</datestamp> <numItems>298</numItems> </header> <header> <setType /> <setSpec /> <setName /> <datestamp>2009</datestamp> <numItems>278</numItems> </header> http://crc.nottingham.ac.uk/ Other verbs • Verbs for listing available argument values – ListSetTypes – ListDateUnits – ListDateTypes – ListCountTypes • Help – Technical help • Identify – Information about the resource http://crc.nottingham.ac.uk/ Some datasets to play with • OpenDOAR – open access repositories – http://opendoar.org/demos/psh.php • SHERPA/RoMEO – Publishers’ policies – http://www.sherpa.ac.uk/romeo/psh.php • Folk Play Scripts database – http://mastermummers.org/scripts/psh.php • Folk Play Groups & Events – http://mastermummers.org/groups/psh.php http://crc.nottingham.ac.uk/ How could this be improved? http://opendoar.org/demos/psh_prototype.php peter.millington@nottingham.ac.uk http://crc.nottingham.ac.uk/