Spiderin' - Scraping Information with Web::Scraper

user-pic

Wiki Extras for this post

I'm a jerk. I failed to give credit to Kieren Diment who helped me with some of the finer points of this script. Thanks Kieren!

Web spiders are not a new idea.

A well written spider can gather a whole lot of data without any labor beyond actually writing it, and in a speedy manner.

I've put together a spider for a recent project I've been working on, using Web::Scraper.

So here we go!

Let's walk through the listing.

We're using Smart::Comments so we don't have to add a bunch of print statements around.

#!/usr/bin/env perl
use Smart::Comments;
use warnings;
use strict;
use Web::Scraper;
use URI;
use YAML qw/ Dump /;
use WWW::Mechanize;
use WWW::Mechanize::Link;
use URI::Query;
use URI::Escape;
use IO::File;

For starters, we grab the URL we want to hit and scrape using WWW::Mechanize.

### On your marks
my $yaml;
my $start_url =  WWW::Mechanize::Link->new( { url =>'your_url_to_scrape'});

### Get Set
my $mech = WWW::Mechanize->new;

Here, we loop through our page, and the subsequent pages if there is pagination.


### GO!
foreach my $l (@letters) {
    my $base_url = $start_url->url;
    $base_url =~ s/__HERE__/currentLetter=$l/;
    my $page = 1;
    ### Letter: $l

    while ($base_url) {
        ### Page: $page
        $mech->get($base_url);
        my $next = $mech->find_link( text_regex => qr/^Next$/i);

        # Bailout
        $base_url = $next ? $next : undef;

        $page++;

        my @gold        = scrape_some('gold', $mech);

        # bailout condition
        undef $base_url  if (!@gold && !@free && !@nearly_free); # nothing on this or subsequent pages for this loop.
        my @information = (@gold, @free, @nearly_free);
        open my $OUT, ">>", "full.yml";
        print $OUT Dump(@information);
        close $OUT;
    }
}

This is the actual Web::Scraper code. Wonderfully, all we have to do is specify some DOM attributes and Web::Scraper will grab the data as we specify. TEXT is the text data in the element being searched for and scraped. @href is the actual link URL.

 
### All done!

sub scrape_some {
    my ( $list_type, $mech ) = @_;
    my @contractors; # return value
    my $want = scraper {
        process "li.$list_type" , "contractors[]" => scraper { 
            process ".listing_link",   name    => 'TEXT';
            process ".address", address => 'TEXT'; # need to split this up into address, state, postcode,
            process ".phone",               phone   => 'TEXT';
            process ".links",                     website => '@href';
        };
    };
    my $ua = $want->user_agent;
    my $names = $want->scrape( $mech->content, $mech->uri);
    my @ppl = ();
    @ppl = @{$names->{data}} if $names->{data};

    foreach my $p (@ppl) {
        if (exists $p->{website}) {
            my $site = $p->{website};
            my $true_url      = URI->new($site);
            my $query = URI::Query->new($true_url->query);
            my $site_from_query = uri_unescape($query->hash_arrayref->{webSite}->[0]);
            $p->{website} = $site_from_query;
        }
        $p->{type} = $list_type;
        push @contractors, @ppl;
    }
    return @contractors;
}

This could definitely be refined, however having not done very many web crawlers in the past, Web::Scraper saved the day for me.

No TrackBacks

TrackBack URL: http://www.catalyzed.org/mt/mt-tb.fcgi/63

5 Comments

| Leave a comment

Glad you like Web::Scraper!

One refinement I could suggest is the loop to update 'website'. If the attr is '@href' you're guaranteed to get URI object instead of a string, so you don't need to call URI->new() yourself. You can also pass in an array reference and then callback functions to filter the value you get, so combining them all:

process ".links", website => [ '@href', sub {
# $_ is the URI object
my $query = URI::Query->new($_->query);
return uri_unescape(...);
};

user-pic

Thanks for the code.

Do take a look in the google biterscripting group at the following thread http://groups.google.com/group/biterscripting/browse_thread/thread/c2d3e7d953b7dc10 .


That also discusses how to scrape data from web pages. Look at the scripts page.txt and pageloop.txt toward the bottom of the thread.


Richard

Thanks Miyagawa!

This code does no justice to W::Scraper, I'm not an experienced web spiderer, so thank you very much for the optimizations.

With any luck I'll be able to prod at a small app to enter an arbitrary URL and grab a given set of data from said site.

+1 to you for writing such a great module.

Richard:

Thanks for the info, I'll certainly check his out and hopefully talk to you more about this sort of thing.

-Devin

I'm glad I saw this, big thanks!

Leave a comment

All comments are moderated. Spammers don't waste your time

Sponsored By


Ionzero: Rescue your dev project.

Following

Not following anyone

Note to spammers: all comments are moderated. Don't waste your time