Modern Perl is all the rage these days in case you hadn't noticed. But why, might you ask. I will demonstrate a scraper program that stores its results in an object database using such juicy bits as Scrappy (a web scraper that in turn uses the Web::Scraper toolkit), KiokuDB (an object store), and of course Moose.
The Goal
Search craigslist for homes matching certain criterion. For example, we'll search for cabins in Montana for sale by owner that fall within a certain price range.
The CraigHome Class
The search results give rise to natural objects where one object is a single result (ad) of the search. The search we'll study is about finding a home (cabin). Let's call the class CraigHome. It is defined like a typical (simple) Moose class:
package CraigHome;
use Moose;
use namespace::autoclean;
has id => (
is => 'ro',
isa => 'Int',
);
has text => (
is => 'ro',
isa => 'Str',
);
has 'link' => (
is => 'ro',
isa => 'URI',
);
has amount => (
is => 'ro',
isa => 'Num',
);
has search_URL => (
is => 'ro',
isa => 'Str',
);
has search_keywords => (
is => 'ro',
isa => 'Str',
);
has search_name => (
is => 'ro',
isa => 'Str',
);
__PACKAGE__->meta->make_immutable;
1
Search and Store Results
We'll use a script to scrape the search results and then store them for later analysis.
Load needed modules
use strict; use warnings; use Scrappy qw/:syntax/; use KiokuDB; use KiokuDB::Backend::DBI; use CraigHome;
Notice that we use Scrappy, KiokuDB, the DBI backend for KiokuDB and our custom class CraigHome.
Define the Search Parameters
The search parameters include:
- a search name
- a URL to search
- keywords to search for
- max. and min. dollar amounts
my %search_definitions = (
1 => {
search_name => 'montana_cabin_by_owner',
search_URIs => [
'http://montana.craigslist.org/reo/',
'http://missoula.craigslist.org/reo/',
'http://bozeman.craigslist.org/reo/'
],
keywords => [qw/ cabin /],
max_price => 60000,
min_price => 1,
},
);
Connect to Kioku Data Store and Designate Columns for Searching
In this example, we use the DBI backend and connect to a SQLite database where we store stuff. In addition, we define two attributes of our CraigHome object that we want to search on. Doing so will index the attributes (so to speak) to provide for more efficient searches. One could always just grep the object attributes from a list of all objects.
my $db = KiokuDB->connect(
"dbi:SQLite:dbname=db/craighomes.db",
create => 1,
columns => [
search_name => {
data_type => "varchar",
is_nullable => 0,
},
amount => {
data_type => "int",
is_nullable => 0,
},
]
);
my $scope_object = $db->new_scope;
Run the main logic
main();
sub main {
init;
user_agent random_ua;
# Do the searches
foreach my $search_definition ( values %search_definitions ) {
# A search definition can have mulitple search URLs
# for the same keywords
foreach my $search_URL ( @{ $search_definition->{search_URIs} } ) {
process_search( $search_URL, $search_definition );
}
}
}
The main logic initializes a Scrappy agent then loops through each search defined (only 1 in this example) and then processes each search URL. In this example we search the (old) Montana page, then the city pages for Missoula and Bozeman.
Process the Search
This part fills out the search form then extracts each listing from the results using Scrappy. Notice the barewords Scrappy uses: init, user_agent, form_fields, loaded etc. We then pass each listing to the process_listing function.
sub process_search {
my $search_URL = shift;
my $search_definition = shift;
get $search_URL;
my $keywords_string = join ' ', @{ $search_definition->{keywords} };
my $search_name = $search_definition->{search_name};
form fields => {
'minAsk' => $search_definition->{min_price},
'maxAsk' => $search_definition->{max_price},
'query' => $keywords_string,
};
print "Processing search: ", $search_definition->{search_name},
" at URL: ${search_URL} with keywords: $keywords_string\n";
# Process each listing, looking for keyword match.
if (loaded) {
var listings => grab 'p a', { name => 'TEXT', link => '@href' };
var listings_textos => grab 'p', { name => 'TEXT' };
foreach my $listing ( list var->{listings} ) {
process_listing( $listing, $search_URL, $keywords_string, $search_name );
}
}
}
Process each Listing
Now that we've plucked the listings we're interested in, let's store them for later analysis. Here we build an instance of our custom CraigHome class, and then persist (store) the object using KiokuDB.
sub process_listing {
my $listing = shift;
my $search_URL = shift;
my $keywords_string = shift;
my $search_name = shift;
my $listing_amount = listing_amount($listing);
my $listing_id = listing_id($listing);
my $listing_object = CraigHome->new(
amount => $listing_amount,
text => $listing->{name},
'link' => $listing->{link},
id => $listing_id,
search_URL => $search_URL,
search_keywords => $keywords_string,
search_name => $search_name,
);
if ( is_new_listing_id($listing_id) ) {
$db->store( $listing_id => $listing_object );
}
}
Now imagine writing something like this in the 1990's or even early 2000's, ouch. Modern Perl helps get shit done with grace and elegance.
Some helper Functions
For completeness we have the following helper functions that extract the listing amount, listing id and determine if we have already have the listing on store.
sub listing_amount {
my $listing = shift;
my ($amount) = $listing->{name} =~ m{^\$(\d+)};
return $amount;
}
sub listing_id {
my $listing = shift;
my ($listing_id) = $listing->{link} =~ m/(\d+)\.html$/;
return $listing_id;
}
sub is_new_listing_id {
my $listing_id = shift;
return !$db->lookup($listing_id) ? 1 : 0;
}
Further Analysis
Now that we have the information on store, we would like to report on it. i.e. retrieve the listing objects that we just stored. That part is left as an exercise to the reader or stay tuned for a follow up article.
Credits
Thanks to perigrin and doy for penetrating my thick skin with an inkling of knowledge about KiokuDB.
Disclaimer
This program is intended for personal use not commercial interests.




Leave a comment
All comments are moderated. Spammers don't waste your time