Sometimes, I just want to count things

advertisement
Sometimes,
I just want to count things
Peter Millington
SHERPA Technical Development Officer
CRC, University of Nottingham
peter.millington@nottingham.ac.uk
http://crc.nottingham.ac.uk/
Actually, that’s a lie
• Just give me numbers for OpenDOAR
– No. of items in ~1,800 repositories
– Growth rates
– Number of full texts v metadata-only records
• More generally (any database or resource)
– No. of records in the database
– No. of records by year, month, etc.
– No. of records by category
http://crc.nottingham.ac.uk/
How difficult can it be?
• Screen scraping?
– Uh-uh-uh
• OAI-PMH – counting identifiers
– BIG files – e.g. DSpace
– Time out!
– Iterative chunks – e.g. EPrints – Yawn
– ‘completeListSize’ argument – If only…
• ORE is no better
– Whatever…
• select count(*) from TABLE; – Duh!
• So back to screen scraping – Sigh
http://crc.nottingham.ac.uk/
It should be as easy as …one…
• Simplicity
• Single SQL SELECT statement
– Anything more is too complex and so too slow
• Single Call/File
– No iteration
• Single simple schema
– XML (+ optional JSON, and other renditions)
http://crc.nottingham.ac.uk/
…two…
Target Performance - Rules of Two
<= 0.2 seconds
– SQL execution
<= 2 seconds
– Rendering the output file
<= 20
– Data points
http://crc.nottingham.ac.uk/
…three
Maximum limits - Rules of Twenty (?)
<= 2 seconds
– SQL execution
<= 20 seconds
– Rendering the output file
<= 200
– Data points
http://crc.nottingham.ac.uk/
Actions speak louder than words
• Protocol for Statistical Harvesting (PSH)
– Base URL + verb + optional arguments
• Specification & Examples
– http://opendoar.org/demos/psh_prototype.php
• Example Base URL:
– http://opendoar.org/demos/psh.php
http://crc.nottingham.ac.uk/
Simplest case - [base
url]?verb=Count
<psh>
<responseDate>2011-02-11T00:05:26Z</responseDate>
<request verb="Count">
http://www.opendoar.org/demos/psh.php
</request>
<Count countType="allItems">
<header>
<setType />
<setSpec />
<setName />
<datestamp />
<numItems>1860</numItems>
</header>
</Count>
</psh>
http://crc.nottingham.ac.uk/
Optional Count Arguments
• &countType – ‘units’ for counts
– e.g. records, repositories, groups, genera, etc
• &setType – some sort of category
– e.g. subject, region, social class, etc.
• &dateUnit
– e.g. decade, year, month
• &dateType
– e.g. Date added, updated, performed, extinct,
etc.
http://crc.nottingham.ac.uk/
Breakdown by year added
<psh>
<responseDate>2011-02-11T00:36:24Z</responseDate>
<request verb="Count">http://www.opendoar.org/demos/psh.php</request>
<Count countType="allItems" dateType="dateAdded">
<header>
<setType />
<setSpec />
<setName />
<datestamp>2008</datestamp>
<numItems>298</numItems>
</header>
<header>
<setType />
<setSpec />
<setName />
<datestamp>2009</datestamp>
<numItems>278</numItems>
</header>
http://crc.nottingham.ac.uk/
Other verbs
• Verbs for listing available argument values
– ListSetTypes
– ListDateUnits
– ListDateTypes
– ListCountTypes
• Help – Technical help
• Identify – Information about the resource
http://crc.nottingham.ac.uk/
Some datasets to play with
• OpenDOAR – open access repositories
– http://opendoar.org/demos/psh.php
• SHERPA/RoMEO – Publishers’ policies
– http://www.sherpa.ac.uk/romeo/psh.php
• Folk Play Scripts database
– http://mastermummers.org/scripts/psh.php
• Folk Play Groups & Events
– http://mastermummers.org/groups/psh.php
http://crc.nottingham.ac.uk/
How could this be improved?
http://opendoar.org/demos/psh_prototype.php
peter.millington@nottingham.ac.uk
http://crc.nottingham.ac.uk/
Download