I'm a jerk. I failed to give credit to Kieren Diment who helped me with some of the finer points of this script. Thanks Kieren!
Web spiders are not a new idea.
A well written spider can gather a whole lot of data without any labor beyond actually writing it, and in a speedy manner.
I've put together a spider for a recent project I've been working on, using Web::Scraper.
So here we go!
Let's walk through the listing.
We're using Smart::Comments so we don't have to add a bunch of print statements around.
#!/usr/bin/env perl use Smart::Comments; use warnings; use strict; use Web::Scraper; use URI; use YAML qw/ Dump /; use WWW::Mechanize; use WWW::Mechanize::Link; use URI::Query; use URI::Escape; use IO::File;
For starters, we grab the URL we want to hit and scrape using WWW::Mechanize.
### On your marks
my $yaml;
my $start_url = WWW::Mechanize::Link->new( { url =>'your_url_to_scrape'});
### Get Set
my $mech = WWW::Mechanize->new;
Here, we loop through our page, and the subsequent pages if there is pagination.
### GO!
foreach my $l (@letters) {
my $base_url = $start_url->url;
$base_url =~ s/__HERE__/currentLetter=$l/;
my $page = 1;
### Letter: $l
while ($base_url) {
### Page: $page
$mech->get($base_url);
my $next = $mech->find_link( text_regex => qr/^Next$/i);
# Bailout
$base_url = $next ? $next : undef;
$page++;
my @gold = scrape_some('gold', $mech);
# bailout condition
undef $base_url if (!@gold && !@free && !@nearly_free); # nothing on this or subsequent pages for this loop.
my @information = (@gold, @free, @nearly_free);
open my $OUT, ">>", "full.yml";
print $OUT Dump(@information);
close $OUT;
}
}
This is the actual Web::Scraper code. Wonderfully, all we have to do is specify some DOM attributes and Web::Scraper will grab the data as we specify. TEXT is the text data in the element being searched for and scraped. @href is the actual link URL.
### All done!
sub scrape_some {
my ( $list_type, $mech ) = @_;
my @contractors; # return value
my $want = scraper {
process "li.$list_type" , "contractors[]" => scraper {
process ".listing_link", name => 'TEXT';
process ".address", address => 'TEXT'; # need to split this up into address, state, postcode,
process ".phone", phone => 'TEXT';
process ".links", website => '@href';
};
};
my $ua = $want->user_agent;
my $names = $want->scrape( $mech->content, $mech->uri);
my @ppl = ();
@ppl = @{$names->{data}} if $names->{data};
foreach my $p (@ppl) {
if (exists $p->{website}) {
my $site = $p->{website};
my $true_url = URI->new($site);
my $query = URI::Query->new($true_url->query);
my $site_from_query = uri_unescape($query->hash_arrayref->{webSite}->[0]);
$p->{website} = $site_from_query;
}
$p->{type} = $list_type;
push @contractors, @ppl;
}
return @contractors;
}
This could definitely be refined, however having not done very many web crawlers in the past, Web::Scraper saved the day for me.




Glad you like Web::Scraper!
One refinement I could suggest is the loop to update 'website'. If the attr is '@href' you're guaranteed to get URI object instead of a string, so you don't need to call URI->new() yourself. You can also pass in an array reference and then callback functions to filter the value you get, so combining them all:
process ".links", website => [ '@href', sub {
# $_ is the URI object
my $query = URI::Query->new($_->query);
return uri_unescape(...);
};
Thanks for the code.
Do take a look in the google biterscripting group at the following thread http://groups.google.com/group/biterscripting/browse_thread/thread/c2d3e7d953b7dc10 .
That also discusses how to scrape data from web pages. Look at the scripts page.txt and pageloop.txt toward the bottom of the thread.
Richard
Thanks Miyagawa!
This code does no justice to W::Scraper, I'm not an experienced web spiderer, so thank you very much for the optimizations.
With any luck I'll be able to prod at a small app to enter an arbitrary URL and grab a given set of data from said site.
+1 to you for writing such a great module.
Richard:
Thanks for the info, I'll certainly check his out and hopefully talk to you more about this sort of thing.
-Devin
I'm glad I saw this, big thanks!