A Little Job Scraper

(2011)

Often times you reach a point in a project where it is handy to have some real data. So today I wrote a little program to grab one page worth of Want Ads from the venerable Craigslist.

Having served its intended purpose, it seemed fun to tweak the program to keep track of new job postings on craigslist. So.. here's that..

This program just reads the pages you specify and scans for any URLs it hasn't seen before. If you run it via cron, say, once a day, it will give you the new postings for that day. Each new url is recorded, so it doesn't notify you twice about the same job.

In python:

import urllib2, time
from BeautifulSoup import BeautifulSoup

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

import socket
socket.setdefaulttimeout(5)

# pages to monitor
categories = [
 "http://knoxville.craigslist.org/sof/",
 "http://knoxville.craigslist.org/eng"
]

# data file for visited url list
dat = ".cl.exclude"

# build list of urls already visited
exclude = []
try:
 for line in open(dat).readlines():
     exclude.append(line[:-1])
except:
 pass


# get unseen urls from each category page
urls = []
for category in categories:
 try:
     page = urllib2.urlopen(category)
     soup = BeautifulSoup(page)
     for a in soup.findAll('a'):
         # must be a url
         if not a.has_key('href'): continue
         # must match current category (to exclude help pages/etc)
         if a['href'].find(category) == -1: continue
         # ok, keep this url
         urls.append(a['href'])
 except Exception, e:
     raise e

# visit each url to get the title and content
for url in urls:
 # skip if already seen
 if a['href'] in exclude: continue
 try:
     page = urllib2.urlopen(url)
     soup = BeautifulSoup(page)
     title = soup.find("title").string
     body = soup.find("div", {"id": "userbody"}).string
     # do something interesting here, like email the list to yourself
     print url, title
 except Exception, e:
     raise e
 # scrape slowly
 time.sleep(10)

# write list of all urls from this time
# note: there is no need to remember ALL the old urls since
# the urls are unique and we aren't dealing with pagination 
# it is safe to forget urls that are past the first page of results
fout = open(dat,'w')
for url in urls:
 fout.write(url+"\n")
fout.close()

Obviously, scraping is potentially rude. This is pretty lightweight, since it only checks URLs it hasn't seen before and waits 10 seconds between visits. Nevertheless, use at your own risk.

The best way to use this is probably tweaking it to email you about new jobs. I've omitted that code since it is:

  1. Pretty well documented elsewhere
  2. Email originating from a home server will probably be rejected anyway