Crawl Operators Workshop

advertisement
Crawl operators workshop
IIPC GA 2014 – Paris
Kristinn Sigurðsson
Scope
• A sequence of DecideRules
– All rules are processed
– Each rule can will either be a match or not
• If not a match, the rule will PASS (have no effect)
• If it matches, it will either
– ACCEPT (means that the URI should be ruled in scope)
– REJECT (means the the URI should be ruled out of scope
– Last rule that does not PASS “wins”
Common example
•
REJECT RejectDecideRule
– Default position: Not in scope
– Nothing gets through unless some rule explicitly decides it should
•
ACCEPT SurtPrefixedDecideRule
– Rules items in based on their domain
– Often uses the seeds list as a source for allowed domains
•
REJECT TooManyHopsDecideRule
– Throw out items that are too far from the seeds
•
ACCEPT TransclusionDecideRule
– Get embeds on domains that are otherwise not in scope
•
REJECT MatchesListRegexDecideRule
– Filter out known bad actors
– Regular expressions used to match problem URIs
– This filter can also be configured to ACCEPT, but it is rarely used that way in scoping
•
ACCEPT PrerequisiteAcceptDecideRule
– Regardless of anything else, we still want to get any prerequisites (dns, robots.txt)
HopsPathMatchesRegexDecideRule
• For RSS crawling, it was really important to have tightly controlled
scope
– Just the embeds
– Still need to follow redirects though
– Works like MatchesRegexDecideRule
– Except the regular expression is applied to the “hoppath” from seed
– .R?((E{0,2})|XE?)
• Allow one redirect, then up to two levels of embedds or a speculative embed
and then an embed
– Can also be used to avoid excessive speculative embeds
• ^[^X]*X[^X]*X[^X]*X[^X]*X.*$.
– Allows a maximum of 4 speculative embeds on the hoppath
• Can also use this with sheets/overrides to affect only select sites with known
issues
More rules
• MatchesFilePatternDecideRule
– Helpful utility version of MatchesRegexDecideRule
– Has pre-compiled regular expressions that match common filetypes
• PathologicalPathDecideRule
– Rejects paths with more than X identical path segments
• X is by default 2
• ScriptedDecideRule
– Allows the operator to specify arbitrary conditions that are expressed
with BeanShell script
• Beanshell scripts are also used in H3 scripting console
• http://www.beanshell.org/
– Offers great flexibility
– For regularly used actions it may be better to create a custom Decide
Rule
Keep in mind
• Some DecideRules operate on content
– E.g. ContentLengthDecideRule
• These will not work for scoping
Getting more advanced
Adding a regex to decide rule
Working with sheets
• Add SURT to Sheet override
appCtx.getBean("sheetOverlaysManager")
.addSurtAssociation("[SURT]","[The sheets bean name]")
• Add rule to sheet
appCtx.getBean( "[Sheet ID]").map.put("[KEY]","[VALUE]");
Canonicalization
• Check canonicalization of an URL
rawOut.println(appCtx.getBean(
"canonicalizationPolicy" ).canonicalize("URL"));
• Add RegexRule canonicalization
org.archive.modules.canonicalize.RegexRule rule = new
org.archive.modules.canonicalize.RegexRule();
rule.setRegex(java.util.regex.Pattern.compile("regex"));
rule.setFormat("format"); // Optional! Defaults to "$1"
appCtx.getBean( "preparer" ).canonicalizationPolicy.rules.add(rule);
Getting really advanced
• Creating custom models for Heritrix isn’t that
difficult, assuming a modest amount of Java
knowledge
– Use a recent version of
Eclipse
– Create a Maven project
with a “provided”
dependency on
Heritrix
Custom modules cont.
• Then setup a run configuration that uses the
org.archive.crawler.Heritrix class from the
dependency
Custom modules cont.
• Then just write new models in your project
and wire them in via the CXML configuration
as normally
• CXML configuration
– https://webarchive.jira.com/wiki/display/Heritrix/Configuring
+Jobs+and+Profiles
Robots
• How can you ignore robot, but still keep track
of which URLs where excluded by robots.txt
– Without consulting the robots.txt file post-crawl?
• PreconditionEnforcer
– calculateRobotsOnly
– Each URL that would have been excluded gets an
annotation in the crawl.log
Download