Creating a Web Crawler in X Easy Steps

advertisement
Creating a Web Crawler in 3
Steps
Issac Goldstand
isaac@cpan.org
Mirimar Networks
http://www.mirimar.net/
The 3 steps
• Creating the User Agent
• Creating the content parser
• Tying it together
Step 1 – Creating the User Agent
• Lib-WWW Perl (LWP)
• OO interface to creating user agents for
interacting with remote websites and web
applications
• We will look at LWP::RobotUA
Creating the LWP Object
• User agent
• Cookie jar
• Timeout
Robot UA extras
• Robot rules
• Delay
• use_sleep
Implementation of Step 1
use LWP::RobotUA;
# First, create the user agent - MyBot/1.0
my $ua=LWP::RobotUA->new('MyBot/1.0', \
'isaac@cpan.org');
$ua->delay(15/60); # 15 seconds delay
$ua->use_sleep(1); # Sleep if delayed
Step 2 – Creating the content parser
•
•
•
•
HTML::Parser
Event-driven parser mechanism
OO and function oriented interfaces
Hooks to functions at certain points
Subclassing HTML::Parser
• Biggest issue is non-persistence
• CGI authors may be used to this, but still
makes for many caveats
• You must implement your own state
preservation mechanism
Implementation of Step 2
package My::LinkParser;
use base qw(HTML::Parser);
# Parser class
use constant START=>0;
use constant GOT_NAME=>1;
# Define simple constants
sub state {
# Simple access methods
return $_[0]->{STATE};
}
sub author {
return $_[0]->{AUTHOR};
}
Implementation of Step 2 (cont)
sub reset {
# Clear parser state
my $self=shift;
undef $self->{AUTHOR};
$self->{STATE}=START;
return 0;
}
sub start {
# Parser hook
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq "meta" && lc($attr->{name}) eq
"author") {
$self->{STATE}=GOT_NAME;
$self->{AUTHOR}=$attr->{content};
}
}
Shortcut HTML::SimpleLinkExtor
• Simple package to extract links from
HTML
• Handles many links – we only want HREF
type links
Step 3 – Tying it together
•
•
•
•
•
Simple application
Instanciate objects
Enter request loop
Spit data to somewhere
Add parsed links to queue
Implementation of Step 3
for (my $i=0;$i<10;$i++) { # Parse loop
my $response=$ua->get(pop @urls); # Get HTTP response
if ($response->is_success) { # If reponse is OK
$p->reset;
$p->parse($response->content); # Parse for author
$p->eof;
if ($p->state==1) { # If state is FOUND_AUTHOR
$authors{$p->author}++; # then add author count
} else {
$authors{'Not Specified'}++; # otherwise add default count
}
$linkex->parse($response->content); # parse for links
unshift @urls,$linkex->a; # and add links to queue
}
}
End result
#!/usr/bin/perl
use strict;
use LWP::RobotUA;
use HTML::Parser;
use HTML::SimpleLinkExtor;
my @urls; # List of URLs to visit
my %authors;
my $ua=LWP::RobotUA->new('AuthorBot/1.0','isaac@cpan.org'); # First, create & setup
the user agent
$ua->delay(15/60); # 15 seconds delay
$ua->use_sleep(1); # Sleep if delayed
my $p=My::LinkParser->new; # Create parsers
my $linkex=HTML::SimpleLinkExtor->new;
$urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs
End result
for (my $i=0;$i<10;$i++) {
# Parse loop
my $response=$ua->get(pop @urls); # Get HTTP response
if ($response->is_success) { # If reponse is OK
$p->reset;
$p->parse($response->content); # Parse for author
$p->eof;
if ($p->state==1) { # If state is FOUND_AUTHOR
$authors{$p->author}++; # then add author count
} else {
$authors{'Not Specified'}++; # otherwise add default count
}
$linkex->parse($response->content); # parse for links
unshift @urls,$linkex->a; # and add links to queue
}
}
print "Results:\n"; # Print results
map {print "$_\t$authors{$_}\n"} keys %authors;
End result
package My::LinkParser;
# Parser class
use base qw(HTML::Parser);
use constant START=>0;
# Define simple constants
use constant GOT_NAME=>1;
sub state {
# Simple access methods
return $_[0]->{STATE};
}
sub author {
return $_[0]->{AUTHOR};
}
sub reset {
# Clear parser state
my $self=shift;
undef $self->{AUTHOR};
$self->{STATE}=START;
return 0;
}
End result
sub start {
# Parser hook
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq "meta" && lc($attr->{name}) eq "author") {
$self->{STATE}=GOT_NAME;
$self->{AUTHOR}=$attr->{content};
}
}
What’s missing?
•
•
•
•
•
Full URLs for relative links
Non-HTTP links
Queues & caches
Persistent storage
Link (and data) validation
In review
• Create robot user agent to crawl websites
nicely
• Create parsers to extract data from sites,
and links to the next sites
• Create a simple program to parse a queue
of URLs
Thank you!
For more information:
Issac Goldstand
isaac@cpan.org
http://www.beamartyr.net/
http://www.mirimar.net/
Download