Often times you reach a point in a project where it is handy to have some real data. So today I wrote a little program to grab one page worth of Want Ads from the venerable Craigslist.
Having served its intended purpose, it seemed fun to tweak the program to keep track of new job postings on craigslist. So.. here’s that..
This program just reads the pages you specify and scans for any URLs it hasn’t seen before. If you run it via cron, say, once a day, it will give you the new postings for that day. Each new url is recorded, so it doesn’t notify you twice about the same job.
In python:
import urllib2, time from BeautifulSoup import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf-8') import socket socket.setdefaulttimeout(5) # pages to monitor categories = [ "http://knoxville.craigslist.org/sof/", "http://knoxville.craigslist.org/eng" ] # data file for visited url list dat = ".cl.exclude" # build list of urls already visited exclude = [] try: for line in open(dat).readlines(): exclude.append(line[:-1]) except: pass # get unseen urls from each category page urls = [] for category in categories: try: page = urllib2.urlopen(category) soup = BeautifulSoup(page) for a in soup.findAll('a'): # must be a url if not a.has_key('href'): continue # must match current category (to exclude help pages/etc) if a['href'].find(category) == -1: continue # ok, keep this url urls.append(a['href']) except Exception, e: raise e # visit each url to get the title and content for url in urls: # skip if already seen if a['href'] in exclude: continue try: page = urllib2.urlopen(url) soup = BeautifulSoup(page) title = soup.find("title").string body = soup.find("div", {"id": "userbody"}).string # do something interesting here, like email the list to yourself print url, title except Exception, e: raise e # scrape slowly time.sleep(10) # write list of all urls from this time # note: there is no need to remember ALL the old urls since # the urls are unique and we aren't dealing with pagination # it is safe to forget urls that are past the first page of results fout = open(dat,'w') for url in urls: fout.write(url+"\n") fout.close()
Obviously, scraping is potentially rude. This is pretty lightweight, since it only checks URLs it hasn’t seen before and waits 10 seconds between visits. Nevertheless, use at your own risk.
The best way to use this is probably tweaking it to email you about new jobs. I’ve omitted that code since it is:
- Pretty well documented elsewhere
- Email originating from a home server will probably be rejected anyway