Creating a Web Crawler in 3 Steps Issac Goldstand isaac@cpan.org Mirimar Networks http://www.mirimar.net/ The 3 steps • Creating the User Agent • Creating the content parser • Tying it together Step 1 – Creating the User Agent • Lib-WWW Perl (LWP) • OO interface to creating user agents for interacting with remote websites and web applications • We will look at LWP::RobotUA Creating the LWP Object • User agent • Cookie jar • Timeout Robot UA extras • Robot rules • Delay • use_sleep Implementation of Step 1 use LWP::RobotUA; # First, create the user agent - MyBot/1.0 my $ua=LWP::RobotUA->new('MyBot/1.0', \ 'isaac@cpan.org'); $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed Step 2 – Creating the content parser • • • • HTML::Parser Event-driven parser mechanism OO and function oriented interfaces Hooks to functions at certain points Subclassing HTML::Parser • Biggest issue is non-persistence • CGI authors may be used to this, but still makes for many caveats • You must implement your own state preservation mechanism Implementation of Step 2 package My::LinkParser; use base qw(HTML::Parser); # Parser class use constant START=>0; use constant GOT_NAME=>1; # Define simple constants sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } Implementation of Step 2 (cont) sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; } } Shortcut HTML::SimpleLinkExtor • Simple package to extract links from HTML • Handles many links – we only want HREF type links Step 3 – Tying it together • • • • • Simple application Instanciate objects Enter request loop Spit data to somewhere Add parsed links to queue Implementation of Step 3 for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } } End result #!/usr/bin/perl use strict; use LWP::RobotUA; use HTML::Parser; use HTML::SimpleLinkExtor; my @urls; # List of URLs to visit my %authors; my $ua=LWP::RobotUA->new('AuthorBot/1.0','isaac@cpan.org'); # First, create & setup the user agent $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed my $p=My::LinkParser->new; # Create parsers my $linkex=HTML::SimpleLinkExtor->new; $urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs End result for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } } print "Results:\n"; # Print results map {print "$_\t$authors{$_}\n"} keys %authors; End result package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } End result sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; } } What’s missing? • • • • • Full URLs for relative links Non-HTTP links Queues & caches Persistent storage Link (and data) validation In review • Create robot user agent to crawl websites nicely • Create parsers to extract data from sites, and links to the next sites • Create a simple program to parse a queue of URLs Thank you! For more information: Issac Goldstand isaac@cpan.org http://www.beamartyr.net/ http://www.mirimar.net/