Getting Scrappy with Modern Perl

user-pic

Wiki Extras for this post

Modern Perl is all the rage these days in case you hadn't noticed. But why, might you ask. I will demonstrate a scraper program that stores its results in an object database using such juicy bits as Scrappy (a web scraper that in turn uses the Web::Scraper toolkit), KiokuDB (an object store), and of course Moose.

The Goal

Search craigslist for homes matching certain criterion. For example, we'll search for cabins in Montana for sale by owner that fall within a certain price range.

The CraigHome Class

The search results give rise to natural objects where one object is a single result (ad) of the search. The search we'll study is about finding a home (cabin). Let's call the class CraigHome. It is defined like a typical (simple) Moose class:

package CraigHome;
use Moose;
use namespace::autoclean;

has id => (
    is  => 'ro',
    isa => 'Int',
);

has text => (
    is  => 'ro',
    isa => 'Str',
);

has 'link' => (
    is  => 'ro',
    isa => 'URI',
);

has amount => (
    is  => 'ro',
    isa => 'Num',
);

has search_URL => (
    is  => 'ro',
    isa => 'Str',
);

has search_keywords => (
    is  => 'ro',
    isa => 'Str',
);

has search_name => (
    is  => 'ro',
    isa => 'Str',
);

__PACKAGE__->meta->make_immutable;
1

Search and Store Results

We'll use a script to scrape the search results and then store them for later analysis.

Load needed modules

use strict;
use warnings;

use Scrappy qw/:syntax/;
use KiokuDB;
use KiokuDB::Backend::DBI;
use CraigHome;

Notice that we use Scrappy, KiokuDB, the DBI backend for KiokuDB and our custom class CraigHome.

Define the Search Parameters

The search parameters include:

  • a search name
  • a URL to search
  • keywords to search for
  • max. and min. dollar amounts
my %search_definitions = (
    1 => {
        search_name => 'montana_cabin_by_owner',
        search_URIs => [
            'http://montana.craigslist.org/reo/',
            'http://missoula.craigslist.org/reo/', 
            'http://bozeman.craigslist.org/reo/'
        ],
        keywords  => [qw/ cabin /],
        max_price => 60000,
        min_price => 1,
    },
);

Connect to Kioku Data Store and Designate Columns for Searching

In this example, we use the DBI backend and connect to a SQLite database where we store stuff. In addition, we define two attributes of our CraigHome object that we want to search on. Doing so will index the attributes (so to speak) to provide for more efficient searches. One could always just grep the object attributes from a list of all objects.

my $db = KiokuDB->connect(
    "dbi:SQLite:dbname=db/craighomes.db",
    create  => 1,
    columns => [
        search_name => {
            data_type   => "varchar",
            is_nullable => 0, 
        },
        amount => {
            data_type   => "int",
            is_nullable => 0, 
        },
    ]
);
my $scope_object = $db->new_scope;

Run the main logic

main();

sub main {
    init;
    user_agent random_ua;

    # Do the searches
    foreach my $search_definition ( values %search_definitions ) {

        # A search definition can have mulitple search URLs 
        # for the same keywords
        foreach my $search_URL ( @{ $search_definition->{search_URIs} } ) {
            process_search( $search_URL, $search_definition );
        }
    }
}

The main logic initializes a Scrappy agent then loops through each search defined (only 1 in this example) and then processes each search URL. In this example we search the (old) Montana page, then the city pages for Missoula and Bozeman.

Process the Search

This part fills out the search form then extracts each listing from the results using Scrappy. Notice the barewords Scrappy uses: init, user_agent, form_fields, loaded etc. We then pass each listing to the process_listing function.

sub process_search {
    my $search_URL        = shift;
    my $search_definition = shift;

    get $search_URL;
    my $keywords_string = join ' ', @{ $search_definition->{keywords} };
    my $search_name = $search_definition->{search_name};
    form fields => {
        'minAsk' => $search_definition->{min_price},
        'maxAsk' => $search_definition->{max_price},
        'query'  => $keywords_string,
    };

    print "Processing search: ", $search_definition->{search_name},
      " at URL: ${search_URL} with keywords: $keywords_string\n";

    # Process each listing, looking for keyword match.
    if (loaded) {
        var listings => grab 'p a', { name => 'TEXT', link => '@href' };
        var listings_textos => grab 'p', { name => 'TEXT' };
        foreach my $listing ( list var->{listings} ) {
            process_listing( $listing, $search_URL, $keywords_string, $search_name );
        }
    }
}

Process each Listing

Now that we've plucked the listings we're interested in, let's store them for later analysis. Here we build an instance of our custom CraigHome class, and then persist (store) the object using KiokuDB.

sub process_listing {
    my $listing         = shift;
    my $search_URL      = shift;
    my $keywords_string = shift;
    my $search_name     = shift;

    my $listing_amount = listing_amount($listing);
    my $listing_id     = listing_id($listing);
    my $listing_object = CraigHome->new(
        amount          => $listing_amount,
        text            => $listing->{name},
        'link'          => $listing->{link},
        id              => $listing_id,
        search_URL      => $search_URL,
        search_keywords => $keywords_string,
        search_name     => $search_name,
    );
    if ( is_new_listing_id($listing_id) ) {
        $db->store( $listing_id => $listing_object );
    }
}

Now imagine writing something like this in the 1990's or even early 2000's, ouch. Modern Perl helps get shit done with grace and elegance.

Some helper Functions

For completeness we have the following helper functions that extract the listing amount, listing id and determine if we have already have the listing on store.

sub listing_amount {
    my $listing = shift;
    my ($amount) = $listing->{name} =~ m{^\$(\d+)};

    return $amount;
}

sub listing_id {
    my $listing = shift;
    my ($listing_id) = $listing->{link} =~ m/(\d+)\.html$/;

    return $listing_id;
}

sub is_new_listing_id {
    my $listing_id = shift;

    return !$db->lookup($listing_id) ? 1 : 0;
}

Further Analysis

Now that we have the information on store, we would like to report on it. i.e. retrieve the listing objects that we just stored. That part is left as an exercise to the reader or stay tuned for a follow up article.

Credits

Thanks to perigrin and doy for penetrating my thick skin with an inkling of knowledge about KiokuDB.

Disclaimer

This program is intended for personal use not commercial interests.

No TrackBacks

TrackBack URL: http://www.catalyzed.org/mt/mt-tb.fcgi/89

Leave a comment

All comments are moderated. Spammers don't waste your time

Sponsored By


Ionzero: Rescue your dev project.

Following

Not following anyone

Note to spammers: all comments are moderated. Don't waste your time