Crawl operators workshop IIPC GA 2014 – Paris Kristinn Sigurðsson Scope • A sequence of DecideRules – All rules are processed – Each rule can will either be a match or not • If not a match, the rule will PASS (have no effect) • If it matches, it will either – ACCEPT (means that the URI should be ruled in scope) – REJECT (means the the URI should be ruled out of scope – Last rule that does not PASS “wins” Common example • REJECT RejectDecideRule – Default position: Not in scope – Nothing gets through unless some rule explicitly decides it should • ACCEPT SurtPrefixedDecideRule – Rules items in based on their domain – Often uses the seeds list as a source for allowed domains • REJECT TooManyHopsDecideRule – Throw out items that are too far from the seeds • ACCEPT TransclusionDecideRule – Get embeds on domains that are otherwise not in scope • REJECT MatchesListRegexDecideRule – Filter out known bad actors – Regular expressions used to match problem URIs – This filter can also be configured to ACCEPT, but it is rarely used that way in scoping • ACCEPT PrerequisiteAcceptDecideRule – Regardless of anything else, we still want to get any prerequisites (dns, robots.txt) HopsPathMatchesRegexDecideRule • For RSS crawling, it was really important to have tightly controlled scope – Just the embeds – Still need to follow redirects though – Works like MatchesRegexDecideRule – Except the regular expression is applied to the “hoppath” from seed – .R?((E{0,2})|XE?) • Allow one redirect, then up to two levels of embedds or a speculative embed and then an embed – Can also be used to avoid excessive speculative embeds • ^[^X]*X[^X]*X[^X]*X[^X]*X.*$. – Allows a maximum of 4 speculative embeds on the hoppath • Can also use this with sheets/overrides to affect only select sites with known issues More rules • MatchesFilePatternDecideRule – Helpful utility version of MatchesRegexDecideRule – Has pre-compiled regular expressions that match common filetypes • PathologicalPathDecideRule – Rejects paths with more than X identical path segments • X is by default 2 • ScriptedDecideRule – Allows the operator to specify arbitrary conditions that are expressed with BeanShell script • Beanshell scripts are also used in H3 scripting console • http://www.beanshell.org/ – Offers great flexibility – For regularly used actions it may be better to create a custom Decide Rule Keep in mind • Some DecideRules operate on content – E.g. ContentLengthDecideRule • These will not work for scoping Getting more advanced Adding a regex to decide rule Working with sheets • Add SURT to Sheet override appCtx.getBean("sheetOverlaysManager") .addSurtAssociation("[SURT]","[The sheets bean name]") • Add rule to sheet appCtx.getBean( "[Sheet ID]").map.put("[KEY]","[VALUE]"); Canonicalization • Check canonicalization of an URL rawOut.println(appCtx.getBean( "canonicalizationPolicy" ).canonicalize("URL")); • Add RegexRule canonicalization org.archive.modules.canonicalize.RegexRule rule = new org.archive.modules.canonicalize.RegexRule(); rule.setRegex(java.util.regex.Pattern.compile("regex")); rule.setFormat("format"); // Optional! Defaults to "$1" appCtx.getBean( "preparer" ).canonicalizationPolicy.rules.add(rule); Getting really advanced • Creating custom models for Heritrix isn’t that difficult, assuming a modest amount of Java knowledge – Use a recent version of Eclipse – Create a Maven project with a “provided” dependency on Heritrix Custom modules cont. • Then setup a run configuration that uses the org.archive.crawler.Heritrix class from the dependency Custom modules cont. • Then just write new models in your project and wire them in via the CXML configuration as normally • CXML configuration – https://webarchive.jira.com/wiki/display/Heritrix/Configuring +Jobs+and+Profiles Robots • How can you ignore robot, but still keep track of which URLs where excluded by robots.txt – Without consulting the robots.txt file post-crawl? • PreconditionEnforcer – calculateRobotsOnly – Each URL that would have been excluded gets an annotation in the crawl.log