Crawl Operators’ Workshop Roger G. Coram Topics • ExternalGeoLocationDecideRule • Sheets – IpAddressSetDecideRule www.bl.uk 2 ExternalGeoLocationDecideRule • Legal Deposit legislation passed in April 2013. • The Legal Deposit Libraries (Non-Print Works) Regulations 2013: – 18 (1) “…a work published on line shall be treated as published in the United Kingdom if: • “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.” www.bl.uk 3 Geolocation • ExternalGeoLocationDecideRule requires: – A list of ISO 3166-1 country-codes to be included in the crawl • GB, FR, DE, etc. – An Implementation of ExternalGeoLookupInterface. www.bl.uk 4 ExternalGeoLookupInterface • Our implementation is based on MaxMind’s GeoLite2 database. • Freely available under ‘Creative Commons Attribution-ShareAlike 3.0 Unported License’. • Only ~30MB; can be held in memory. www.bl.uk 5 crawler-beans.cxml Configuration example: <!-- GEO-LOOKUP: specifying location of external database. --> <bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup"> <property name="database" value="/dev/shm/geoip-city.mmdb"/> </bean> <!-- ... ACCEPT those in the UK... --> <bean id="externalGeoLookupRule" class="org.archive.crawler.deciderules.ExternalGeoLocationDecideRule"> <property name="lookup"> <ref bean="externalGeoLookup"/> </property> <property name="countryCodes"> <list> <value>GB</value> </list> </property> </bean> www.bl.uk 6 Results • Short test crawl (1,000,000 seeds) produced: – 89,500,755 URLs in total. – 26,072 non-UK URLs which would not otherwise been in scope. • 137 distinct hosts. www.bl.uk 7 IP-based Sheets “Hi, “I'm a senior system administrator for Webfusion / 123-reg. “We're currently experiencing lots of requests from crawler1.bl.uk to sites hosted on 81.21.76.62 , this is part of our Parking platform, which links into Yahoo to allow customers to park domains and earn money.” • Large number of hosts on a single machine. • Need a way to reduce the load on a specific IP address. www.bl.uk 8 Sheets • “Sheets provide the ability to replace default settings on a per domain basis.” – Allow you to change any value on any named bean for a specific set of URLs. • Actually quite flexible: – SurtPrefixesSheetAssociation • Applied by matching SURT prefixes. – DecideRuledSheetAssociation: • Applied a series of DecideRules. – IpAddressSetDecideRule www.bl.uk 9 1. crawler-beans.cxml Configuration example: <bean id="extraPolite" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="disposition.delayFactor" value="8.0"/> <entry key="disposition.minDelayMs" value="10000"/> <entry key="disposition.maxDelayMs" value="60000"/> <entry key="disposition.respectCrawlDelayUpToSeconds" value="60"/> </map> </property> </bean> <bean id="crawlLimited" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="quotaEnforcer.serverMaxFetchResponses" value="25"/> </map> </property> </bean> www.bl.uk 10 2. crawler-beans.cxml Configuration example: <bean class="org.archive.crawler.spring.DecideRuledSheetAssociation"> <property name="rules"> <bean class="org.archive.modules.deciderules.IpAddressSetDecideRule"> <property name="ipAddresses"> <set> <value>81.21.76.62</value> </set> </property> <property name="decision" value="ACCEPT"/> </bean> </property> <property name="targetSheetNames"> <list> <value>extraPolite</value> <value>crawlLimited</value> </list> </property> </bean> www.bl.uk 11 Thank you GitHub: https://github.com/ukwa/bl-heritrix-modules MaxMind: http://dev.maxmind.com/geoip/geoip2/geolite2/ www.bl.uk 12