Heritrix workshop

advertisement
Crawl Operators’ Workshop
Roger G. Coram
Topics
• ExternalGeoLocationDecideRule
• Sheets
– IpAddressSetDecideRule
www.bl.uk
2
ExternalGeoLocationDecideRule
• Legal Deposit legislation passed in April 2013.
• The Legal Deposit Libraries (Non-Print Works) Regulations
2013:
– 18 (1) “…a work published on line shall be treated as
published in the United Kingdom if:
• “(b) it is made available to the public by a person and any
of that person’s activities relating to the creation or the
publication of the work take place within the United
Kingdom.”
www.bl.uk
3
Geolocation
• ExternalGeoLocationDecideRule
requires:
– A list of ISO 3166-1 country-codes to
be included in the crawl
• GB, FR, DE, etc.
– An Implementation of
ExternalGeoLookupInterface.
www.bl.uk
4
ExternalGeoLookupInterface
• Our implementation is based
on MaxMind’s GeoLite2
database.
• Freely available under
‘Creative Commons
Attribution-ShareAlike 3.0
Unported License’.
• Only ~30MB; can be held in
memory.
www.bl.uk
5
crawler-beans.cxml
Configuration example:
<!-- GEO-LOOKUP: specifying location of external database. -->
<bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup">
<property name="database" value="/dev/shm/geoip-city.mmdb"/>
</bean>
<!-- ... ACCEPT those in the UK... -->
<bean id="externalGeoLookupRule" class="org.archive.crawler.deciderules.ExternalGeoLocationDecideRule">
<property name="lookup">
<ref bean="externalGeoLookup"/>
</property>
<property name="countryCodes">
<list>
<value>GB</value>
</list>
</property>
</bean>
www.bl.uk
6
Results
• Short test crawl (1,000,000 seeds) produced:
– 89,500,755 URLs in total.
– 26,072 non-UK URLs which would not otherwise been in
scope.
• 137 distinct hosts.
www.bl.uk
7
IP-based Sheets
“Hi,
“I'm a senior system administrator for Webfusion / 123-reg.
“We're currently experiencing lots of requests from crawler1.bl.uk to
sites hosted on 81.21.76.62 , this is part of our Parking platform,
which links into Yahoo to allow customers to park domains and earn
money.”
• Large number of hosts on a single machine.
• Need a way to reduce the load on a specific IP address.
www.bl.uk
8
Sheets
• “Sheets provide the ability to replace default settings on a
per domain basis.”
– Allow you to change any value on any named bean for a
specific set of URLs.
• Actually quite flexible:
– SurtPrefixesSheetAssociation
• Applied by matching SURT prefixes.
– DecideRuledSheetAssociation:
• Applied a series of DecideRules.
– IpAddressSetDecideRule
www.bl.uk
9
1. crawler-beans.cxml
Configuration example:
<bean id="extraPolite" class="org.archive.spring.Sheet">
<property name="map">
<map>
<entry key="disposition.delayFactor" value="8.0"/>
<entry key="disposition.minDelayMs" value="10000"/>
<entry key="disposition.maxDelayMs" value="60000"/>
<entry key="disposition.respectCrawlDelayUpToSeconds" value="60"/>
</map>
</property>
</bean>
<bean id="crawlLimited" class="org.archive.spring.Sheet">
<property name="map">
<map>
<entry key="quotaEnforcer.serverMaxFetchResponses" value="25"/>
</map>
</property>
</bean>
www.bl.uk
10
2. crawler-beans.cxml
Configuration example:
<bean class="org.archive.crawler.spring.DecideRuledSheetAssociation">
<property name="rules">
<bean class="org.archive.modules.deciderules.IpAddressSetDecideRule">
<property name="ipAddresses">
<set>
<value>81.21.76.62</value>
</set>
</property>
<property name="decision" value="ACCEPT"/>
</bean>
</property>
<property name="targetSheetNames">
<list>
<value>extraPolite</value>
<value>crawlLimited</value>
</list>
</property>
</bean>
www.bl.uk
11
Thank you
GitHub: https://github.com/ukwa/bl-heritrix-modules
MaxMind: http://dev.maxmind.com/geoip/geoip2/geolite2/
www.bl.uk
12
Download