Random Links

This is super old and super cool – an article from 2007 in the New York Times about using Amazon/EC2/S3/Hadoop to produce static PDFs from 71 years (4TB) of NYT archives. I don’t think I had even heard of Hadoop in 2007.

Posted in Programming, Random Links | Leave a comment

A Little Trick For Dealing With Lots Of View Files

I’ve been messing around with the Kohana Framework a lot lately. The simplicity of using the View Factory has led me to have a ton of little view snippets and it gets tricky remembering where they all live. So I started adding this line to the top of all my views:

<?php if(Kohana::$environment !== 'production'): echo('<!-- '.__FILE__.' -->'); endif; ?>

When there’s a problem, I can hit “view-source” and instantly locate the offending file. Not exactly some brilliant revolution, but it does come in handy.

Three more thoughts:

  • A better solution is probably to subclass View and do this automatically, instead of appending the beginning of every file
  • Doing this in production code could leave hints to attackers about the layout of your webserver. The environment check should mitigate that, but still be careful.
  • The HTML comment may cause a problem if you’re using it for AJAX to generate HTML on the fly – I need to look into that further
Posted in Programming | Tagged | Leave a comment

… And We’re Back

Sorry about the delay folks. I meant to take the site offline for a day or two, but it ended up being a couple of months! You know how it goes – you get distracted working on other stuff and kinda just forget.. Anyway, now it’s back and I’m planning to write a bit more frequently.

I nuked everything when I took it offline; however, at least a couple of posts were actually useful to people, so those are being reposted and back-dated to their original publication dates. A lot of the content was just me thinking around a problem without getting into the specifics – those posts have not been restored.

Looking at my writing, I noticed that I tend towards the abstract in discussing programming problems and don’t post enough code. Hoping to remedy that going forward. Stay tuned!

Posted in Uncategorized | Leave a comment

Ranking Blogs by Readability

Further proof that I’m a dork: this afternoon, instead of working on my apps, I was screwing around with pydot and matplotlib making visualizations of user engagement (on another blog). At one point it occurred to me that it might be interesting to plot the reading grade level of various blogs. So here goes:

Determining reading level is tricky business because there are so many different kinds of texts. Most methods seem to boil down to some combination of sentence length, word length, number of syllables and some magic numbers: throw it all together and you’ve got a score. Different scoring systems measure slightly different things, but it usually ends up as either a grade level or some numeric measure of reading difficulty or ease. For more background, wikipedia is happy to go into great depth: Flesch-Kincaid Readability Test. I admit to kinda skimming the article because there was no way I was gonna implement the various metrics anyway.

Happily, there’s GNU style, a command line program which has already done the dirty work. Using the previous sentence, it outputs (abbreviated):

...
Kincaid: 6.0
ARI: 8.7
Coleman-Liau: 12.5
Flesch Index: 78.8/100
Fog Index: 6.0
Lix: 48.3 = school year 9
SMOG-Grading: 3.0
...

Hot.

The goal is to rank the reading difficulty of some blogs. So here’s the plan:

1. Get a list of blogs

2. Download each blog’s RSS feed

3. Run the combined content of each blog through style

4. Parse the results to get scores

5. Make pretty graphs

6. Draw unnecessarily broad conclusions

The blogs used in my experiment are: Gawker, TMZ, TheAwl, Treehugger, EnGadget, ABC News, Huffington Post, Wired and Go Fug Yourself – a mix of news news, computer news, and celebrity news. The list is kinda random and kinda also pulled from the list given in the book Programming Collective Intelligence, which peripherally inspired this idea. Certainly it would be interesting to repeat the experiment on more blogs covering a wider range of subjects and intended for different audiences.

This first graph is the Fog Index, which corresponds to something like grade level.

Since this program was pieced-together in a couple of hours this afternoon, there are plenty of deficiencies. For example, readability metrics tend to assume that you’re working with something like normal paragraphs. When you’re dealing with blogs, that’s not necessarily the case, as you get things like Top-10 lists and Image-only posts. Annoyingly, many blogs only give a one sentence summary in the feed, instead of the full content. Not having enough sample text throws off the metrics. My program doesn’t check for that or a host of other big and little things. On the other hand, the graphs kinda agree with what you’d expect, so I think there is some merit to the general method. Maybe something to explore later..

And, here’s the source. I’ll be the first to admit that it is kind of a trainwreck and could use substantial cleanup, but it does work! You are free to use this code in any way you see fit. But if you do something stupid with it, that’s not my fault..

You need a small handful of things for this to work: python, matplotlib, numpy, feedparser, and also GNU style and probably Linux, though it might work on a Mac or Windows? Dunno..

# updated 7-20-2011 to parse feeds a bit more robustly
# based partially on Mining the Social Web
# http://www.amazon.com/dp/1449388345
 
# fetch feeds
import os
import urllib2
import feedparser
 
# pipe to GNU style
from subprocess import Popen, PIPE
 
# clean up html
from nltk import clean_html
from BeautifulSoup import BeautifulStoneSoup
 
# plotting things
import matplotlib.pyplot as plot
import numpy.numarray as na
from pylab import get_cmap
 
# stfu unicode decode error
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
 
 
def cleanhtml(html):
    return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
 
def get_score(text):
    score = {'kincaid': 0, 'ari': 0, 'coleman': 0, 'flesch': 0, 'fog': 0, 'lix': 0, 'smog': 0, }
    sp = Popen(["style"], stdin=PIPE, stdout=PIPE)
    sp.stdin.write(text)
    sp.stdin.close()
    result = sp.stdout.read()
    lines = result.split("\n")
    for line in lines:
        if line.find('Kincaid:') > 0:
            parts = line.split('Kincaid:')
            score['kincaid'] = float(parts[1])
        elif line.find('ARI:') > 0:
            parts = line.split('ARI:')
            score['ari'] = float(parts[1])
        elif line.find('Coleman-Liau:') > 0:
            parts = line.split('Coleman-Liau:')
            score['coleman'] = float(parts[1])
        elif line.find('Flesch Index:') > 0:
            parts = line.split('Flesch Index:')
            parts = parts[1].split('/')
            score['flesch'] = float(parts[0])
        elif line.find('Fog Index:') > 0:
            parts = line.split('Fog Index:')
            score['fog'] = float(parts[1])
        elif line.find('Lix:') > 0:
            parts = line.split('Lix:')
            parts = parts[1].split('=')
            score['lix'] = float(parts[0])
        elif line.find('SMOG-Grading:') > 0:
            parts = line.split('SMOG-Grading:')
            score['smog'] = float(parts[1])
        else:
            pass
    return score
 
 
 
full_feeds = {}
 
# fetch RSS for URLs in file "feeds.txt"
FEEDS = 'feeds.txt'
feeds = open(FEEDS).readlines()
for feed in feeds:
    fp = feedparser.parse(feed)
    blog_posts = []
    for e in fp.entries:
        if e.has_key('content'):
            blog_posts.append({'title': e.title, 'content': cleanhtml(e.content[0].value), 'link': e.links[0].href})
        elif e.has_key('summary_detail'):
            blog_posts.append({'title': e.title, 'content': cleanhtml(e.summary_detail.value), 'link': e.links[0].href})
    if blog_posts:
        text = ''.join(post['content'] for post in blog_posts)
        full_feeds[feed] = text
 
scores = {'kincaid': {}, 'ari': {}, 'coleman': {}, 'flesch': {}, 'fog': {}, 'lix': {}, 'smog': {} } 
 
# calculate reading ease score for each blog
for feed,text in full_feeds.items():
    score = get_score(text)
    name = feed.replace('http://','').replace('www','').replace('.com','').replace('feed','')
    name = name.split('/')[0]
    for k,v in score.items():
        scores[k][name] = v
    print "%s %s %s %s %s %s %s %s" % (name, score['kincaid'], score['ari'], score['coleman'], score['flesch'], score['fog'], score['lix'], score['smog'])
 
# plot results
color_map = get_cmap('gist_rainbow')
for kind,vals in scores.items():
    vals = [(k,v) for k,v in vals.items()]
    labels = [y[0] for y in vals]
    width = 0.5
 
    y1s = [y[1] for y in vals]
    x1s = na.array(range(len(y1s)))+width
    colors = [color_map(1.*i/len(x1s)) for i in range(len(x1s))]
 
    for x, y, c in zip(x1s, y1s, colors):
        plot.bar(x, y, width=width, color=c)
 
    plot.title(kind)
    plot.xticks(x1s + width/2, labels, rotation=270)
 
    # save, clear plot, clear axes
    plot.savefig("scores-%s.png" % kind)
    plot.clf()
    plot.cla()

Also, you need a file called “feeds.txt” to specify which feeds you want to compare, here’s the one I used for this post:


http://feeds.feedburner.com/TheAwl?format=xml

http://feeds.gawker.com/gawker/full

http://www.tmz.com/rss.xml

http://feeds2.feedburner.com/celebuzz/Kggb

http://rss.slashdot.org/Slashdot/slashdot

http://feeds.huffingtonpost.com/huffingtonpost/raw_feed

http://blogs.abcnews.com/theblotter/index.rdf

http://www.engadget.com/rss.xml

http://www.treehugger.com/index.rdf

* I was getting some unusual results from CNN and BBC feeds with scores dramatically outside the range of the other blogs. I haven’t really looked into it yet, but just a note – this method seems pretty fragile.

Posted in Programming | Tagged | Leave a comment

Automatically Converting PNGs to JPEGs in WordPress

This took me entirely too long to figure out.. The goal was to reduce the load page and server load by converting PNGs to JPEGs. This method may be lack a bit of subtlety, as it *always* converts PNGs to JPEGs. Maybe sometimes you actually do want a PNG. Something to consider for the future.. anyway, dropping this in your functions.php or wrapping it in a plugin should do the trick:

add_action('wp_handle_upload', 'my_resample_handle_upload');
function my_resample_handle_upload($arr) {
	if($arr['type'] != 'image/png') {
		return $arr;
	}
	$file = $arr['file'];
	$url = $arr['url'];
	$dst_file = substr($file, 0, -3) . 'jpg';
	$dst_url = substr($url, 0, -3) . 'jpg';
	list($width, $height, $type, $attr) = getimagesize($file);
	$image = imagecreatefrompng($file);
	$new_image = imagecreatetruecolor($width, $height);
	imagecopyresampled($new_image, $image, 0, 0, 0, 0, $width, $height, $width, $height);
	imagejpeg($new_image, $dst_file);  
	imagedestroy($new_image);
	$arr['file'] = $dst_file;
	$arr['url'] = $dst_url;
	$arr['type'] = 'image/jpeg';
	return $arr;
}

Seems to do the job. Lemme know if you run into difficulties or see problems with this method.

Posted in Programming, Wordpress | Tagged , | Leave a comment

An Easy Munin PHP Plugin

I’ve been playing sysadmin lately, which is not my favorite thing in the world. Anyway, while trying to read up a bit on monitoring, I came across Munin – one of those apps I’ve heard of but never really gotten around to investigating properly. Well, it is pretty darn cool – and you can write your own plugins – and I love writing plugins.

I suppose it may be a bit more complicated, but from my perspective, it collects stats every 5 minutes and makes pretty graphs with rrdtool. In the simplest configuration, one computer is both the client (a munin-node, which collects its own stats) and the server (munin, which aggregates stats for all the clients). Then you leave it running for a few days, weeks or years and you’ll have a good idea of what your system should look like when things are running smoothly and when they are not. Hopefully this will allow you to anticipate and avoid problems, or at least understand what happened afterwards.

So that’s neat. Here’s a graph: (the empty area is where a server was rebooted and munin-node wasn’t set to restart automatically)

So every 5 minutes, munin-node activates itself and runs all plugins you’ve got setup. Each plugin outputs a label and a number (or multiple labels and numbers), which the server collects and uses to make graphs. So writing your own plugin is super easy. Here is one in PHP that graphs the number of registered users in a WordPress database.

#!/usr/bin/php
<?php
 
// these are setup in the plugin config, or you could hardcode them
$db = getenv('DB');
$host = getenv('HOST');
$user = getenv('USER');
$pass = getenv('PASS');
$prefix = getenv('PREFIX');
 
// this is for munin's configuration tool
// could do something more complicated here
if(count($argv) == 2 && $arv[1] == 'authconf') {
	exit('yes');
}
// this is for rrdtool to know how to label the graph
if(count($argv) == 2 && $argv[1] == 'config') {
	echo("graph_title Users\n");
	echo("graph_vlabel count\n");
	echo("graph_category Wordpress\n");
	echo("wordpress_users.label Registered Users\n");
	echo("wordpress_users.type COUNTER\n");	
	exit();
}
 
// this is the usual case, generating a label and value
mysql_connect($host, $user, $pass);
mysql_select_db($db);
$rs = mysql_query("SELECT COUNT(*) FROM {$prefix}users");
$rows = mysql_fetch_array($rs);
echo("wordpress_users.value {$rows[0]}\n");

Obviously you’ll need to have a command-line PHP available at /usr/bin/php.

To install it, put the file in the available plugins directory and symlink it to the active plugins directory:

cp /path/to/plugin /usr/share/munin/plugins/your_plugin_name
ln -s /usr/share/munin/plugins/your_plugin_name /etc/munin/plugins/your_plugin_name

So that’s pretty easy. There is also the plugin configuration where you setup the environment variables. This is particularly handy, as you could use multiple configurations of the same plugin to provide stats about different databases.

In this case, I just edited /etc/munin/plugin-conf.d/munin-node and created a new section matching my plugin’s name, but I think there are other ways to do it for more complex configurations:

[your_plugin_name]
env.DB db_name_here
env.HOST localhost
env.USER db_user_here
env.PASS db_pass_here
env.PREFIX table_prefix_here

And finally, finally, you can test it to make sure it is working using munin-run. This is important because simply running it by self will not setup the environment variables. It wasn’t on my path, but ended up in /usr/sbin:

/usr/sbin/munin-run your_plugin_name
wordpress_users.value 210

Oh yeah. When you’re done testing and want to add your plugin for real, you need to restart munin-node:

/etc/init.d/munin-node restart

So I think it’s a pretty cool and easy way to monitor just about anything. As always, there may be errors or inaccuracies, so munin experts, please offer corrections if you see a problem.

References:
My PHP plugin was based heavily on: http://thomasfischer.biz/?p=174

Posted in Programming | Tagged , | Leave a comment

Making My Own WordPress Chartbeat Plugin

Instead of doing something useful this morning, I made my own little plugin using the Chartbeat API to display the most popular posts on a WordPress blog.

Note: There is really no reason to do this. The Chartbeat Plugin does this exact same thing and more. However, it was an entertaining exercise for me to practice writing wordpress plugins.

Also Note: This only works if you have signed up for Chartbeat and get an API Key.

The reason this is cool? Well, most of your “most popular posts” plugins need to make an extra call to the database to get/set a counter because wordpress doesn’t track page views by default. But if you’re using chartbeat to track your blog’s performance, you can save some effort by using their numbers instead.

And with no further ado, here’s the code:

<?php
/*
Plugin Name: Ct Most Popular
Plugin URI: http://www.craiget.com
Description: Display most viewed posts using the Chartbeat API, exposes one function: ct_most_popular_plugin_widget(); 
Version: 0.1
Author: Craige
Author URI: http://craiget.com
License: For example and testing purposes. Not suggested for use on a real site.
*/
 
$ct_most_popular_plugin_version = "0.1";
 
$ct_most_popular_plugin_data = array();
 
// create a most_popular option
register_activation_hook(__FILE__, 'ct_most_popular_plugin_install');
function ct_most_popular_plugin_install()
{
	add_option("ct_most_popular_plugin_data", $ct_most_popular_data);
	add_option("ct_most_popular_plugin_version", $ct_most_popular_plugin_version);
	// schedule hourly update
	wp_schedule_event(time(), 'hourly', 'ct_most_popular_plugin_update_event');
}
 
// delete the most_popular option
register_deactivation_hook(__FILE__, 'ct_most_popular_plugin_uninstall');
function ct_most_popular_plugin_uninstall()
{
	delete_option("ct_most_popular_plugin_data");
	delete_option("ct_most_popular_plugin_version");
	// un-schedule hourly update
	wp_clear_scheduled_hook('ct_most_popular_plugin_update_event');
}
 
// appear under "Settings" on the admin page
add_action('admin_menu', 'ct_most_popular_plugin_menu');
function ct_most_popular_plugin_menu() {
	add_options_page('Ct Most Popular', 'Ct Most Popular', 'manage_options', '', 'ct_most_popular_plugin_options');
}
 
// init option values in db
add_action('admin_init', 'ct_most_popular_plugin_options_init' );
function ct_most_popular_plugin_options_init(){
	register_setting('ct_most_popular_plugin_options', 'ct_most_popular_plugin', 'ct_most_popular_plugin_validate' );
}
 
// sanitize and validate input
function ct_most_popular_plugin_validate($input) {
	$input['host'] =  wp_filter_nohtml_kses($input['host']);
	$input['chartbeat_api_key'] =  wp_filter_nohtml_kses($input['chartbeat_api_key']);
	$input['limit'] =  (int)($input['limit']);
	if($input['limit'] == 0) $input['limit'] = 10;
	return $input;
}
 
// display options page html
function ct_most_popular_plugin_options() {
	if (!current_user_can('manage_options'))  {
		wp_die(__('You do not have sufficient permissions to access this page.') );
	}
?>
<div class="wrap">
	<h2>Ct Most Popular Plugin Options Title</h2>
	<form method="post" action="options.php">
		<?php settings_fields('ct_most_popular_plugin_options'); ?>
		<?php $options = get_option('ct_most_popular_plugin'); ?>
		<table class="form-table">
		<tr valign="top">
			<th scope="row">Host</th>
			<td><input type="text" name="ct_most_popular_plugin[host]" value="<?php echo $options['host']; ?>" /></td>
			<td><i>ie, example.com</i></td>
		</tr>
		<tr valign="top">
			<th scope="row">Chartbeat API Key</th>
			<td><input type="text" name="ct_most_popular_plugin[chartbeat_api_key]" value="<?php echo $options['chartbeat_api_key']; ?>" /></td>
			<td><i><a href="http://chartbeat.com/apikeys/">http://chartbeat.com/apikeys/</a></i></td>
		</tr>
		<tr valign="top">
			<th scope="row">Limit</th>
			<td><input type="text" name="ct_most_popular_plugin[limit]" value="<?php echo $options['limit']; ?>" /></td>
			<td><i>number of items to show, 10</i></td>
		</tr>
		</table>
		<p class="submit">
			<input type="submit" class="button-primary" value="<?php _e('Save Changes') ?>" />
		</p>
		<p>
		This plugin uses the <a href="http://chartbeat.pbworks.com/">Chartbeat API</a> to show the most popular pages on your site, updated hourly.
		</p>
		<p>
		This plugin was created for my own amusement and to practice creating Wordpress plugins, it is <strong>NOT RECOMMENDED</strong> for use.
		</p>
		<p>
		Chartbeat has released a perfectly good plugin that does this and more: <a href="http://wordpress.org/extend/plugins/chartbeat/">http://wordpress.org/extend/plugins/chartbeat/</a>
		</p>
		<p>
		This plugin fetches new data once every hour using Wordpress's built-in <a href="http://codex.wordpress.org/Function_Reference/wp_schedule_event">scheduling hooks</a> to update the list of popular posts hourly.
		This keeps things self-contained, but doesn't provide much flexibility. You may want to use cron instead, which would require a little hacking.
		</p>
	</form>
</div>
<?php
}
 
// get popularity data from chartbeat, store in db
add_action('ct_most_popular_plugin_update_event', 'ct_most_popular_plugin_update_chartbeat');
function ct_most_popular_plugin_update_chartbeat() {
	// construct chartbeat call
	$options = get_option('ct_most_popular_plugin');
	$host = $options['host'];
	$apikey = $options['chartbeat_api_key'];
	$limit = $options['limit'];
	// build url
	$url = 'http://api.chartbeat.com/toppages/?host=HOST&limit=LIMIT&apikey=APIKEY';
	$url = str_replace('HOST', $host, $url);
	$url = str_replace('APIKEY', $apikey, $url);
	$url = str_replace('LIMIT', $limit, $url);
	// fetch data
	$data = file_get_contents($url);
	$data = json_decode($data, true);
	// exit if not enough results back
	if(count($data) < $limit)
		return;
	$result = array();
	for($i=0; $i<count($data); $i++) {
		if($data[$i]['path'] == "/")
			continue;
		$result[] = $data[$i];
	}
	$result = array_slice($result, 0, $limit);
	// store in db
	update_option("ct_most_popular_plugin_data", $result);
}
 
// add this function in your sidebar
function ct_most_popular_plugin_widget() {
	$data = get_option("ct_most_popular_plugin_data");
	echo('<ul>');
	foreach ($data as $post) {
		echo('<li>');
		echo('<a href="'.$post['path'].'">'.$post['visitors'].'-'.$post['i'].'</a>');
		echo('</li>');
	}
	echo('</ul>');
}

Go to “Settings” > “Ct Most Popular” to set your API Key and other options.

Updates occur once each hour.

You’ll almost certainly want to tweak the way the posts are displayed in the ct_most_popular_plugin_widget() function.

Anyway.. just fooling around.. For all the frustration it has caused me.. Still gotta say, WordPress is pretty friggin’ cool.

Posted in Programming, Wordpress | Tagged , , | Leave a comment

Fetching Android Market Stats with Selenium RC

Finally.. I’ve got a reasonably decent way to pull Android Market stats. For some reason I keep coming back to this topic. This time, the way forward is to use Selenium RC, part of the Selenium browser testing suite.

My example will be in Python, but Selenium has bindings for several languages.

First of all, you gotta download Selenium RC from here: http://seleniumhq.org/download/

Then, extract it someplace you can remember. I’ve been putting things in ~/opt lately.

Okay, now create a new python script, comma ca:

import sys
sys.path.append('/the/path/to/selenium-python-client-driver-1.0.1')
 
from selenium import selenium
 
email = 'YOUR_GOOGLE_LOGIN'
passwd = 'YOUR_PASSWORD'
 
s = selenium("localhost", 4444, "*firefox", "http://market.android.com")
s.start()
s.open("/publish/Home")
s.type("Email", email)
s.type("Passwd", passwd)
s.click("signIn")
s.wait_for_page_to_load("30000")
 
n = int(s.get_xpath_count("//div[@class='listingRow']"))
for i in range(3,n):
  try:
    title = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[1]/div[1]" % i)
    downloaded = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[2]/div[1]/span[1]" % i)
    installed = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[2]/div[2]/span[1]" % i)
    comments = s.get_text("xpath=(//div[@class='listingRow'])[%s]/table" % i)[1:-1]
    print title, downloaded, installed, comments
  except:
    pass

* Be sure to fill in YOUR_GOOGLE_LOGIN with your email (or whatever login) and the matching password.

This script is a bit of a trainwreck.. but it works and I don’t feel like screwing with it..

* Working with xpath in selenium-rc’s python binding feels really weird.. doesn’t seem to behave quite the way you would expect.

* Why does the iteration start at 3? I dunno.. there are some empty rows at the beginning I guess..

* Why is it wrapped in a try-except block? I dunno.. some empty rows at the end?

* It works on Ubuntu 10.04 / FF 3.6.3. Your mileage may vary. I wouldn’t be surprised if those xpath selectors needed more tweaking in some cases.

To run the script, you need to start the Selenium RC server. Go to the place you downloaded it:

cd /path/to/selenium
java -jar selenium-server.jar

Then, you should be able to run this script from a terminal and it will start firefox, log you in to the Android Developer Console, wait a few seconds til the Ajax all loads, then use xpath to scrape each row of data from the table and print it to the terminal.

From there it should be pretty simple to export the results into a CSV file or make pretty charts or whatever it is you wanna do.

It does pop up a window on the screen, which is kinda annoying. Cooler to run firefox headless, maybe some other time..

Posted in Programming | Tagged , | Leave a comment

Can Clojure Find Me An Apartment?

This post was going to be about how I spent the better part of a day trying to get clojure and emacs and slime and the java classpath all working together.

The gist of it is this: I am an idiot sometimes. I spent most of an afternoon trying to figure out why it is an error to (use ‘clojure.contrib). Earlier in the day, my classpath was setup wrong, so (use ‘clojure.contrib.duck-streams) didn’t work. At some point, I stopped typing the whole thing, thinking that if ‘clojure.contrib.duck-streams works, then so should the parent package ‘clojure.contrib. A-ha! Save myself a bit of typing! Nope. That never works.. so, when I finally did get my classpath working, I didn’t know it because I was typing something that’s just plain wrong. Hilarious and Awesome, huh?

So, with everything finally working, I made my first little half-way real Clojure program.

Our current lease runs out in about a 6 weeks, so me and my roommate need to find a new place to live – sounds like a job for Craigslist. There’s a problem though: in big cities, Craigslist is absolutely flooded with apartments and the search functions just aren’t that good. I have no interest in skimming hundreds or thousands of posts looking for that perfect combination of price/location/amenities (well, mostly price and location, actually), so why not let the computer do the work instead? Usually this would be a job for Python/BeautifulSoup, but in the interest of learning Clojure, here goes..

Following is what I’ve come up with so far for scraping apartments off Craiglist as gently as possible by filtering out links that don’t meet my criteria. Right now, this code only generates the list of matching links, it doesn’t actually follow them. If I continue further with this program, that will be Step 2, probably using http://lethain.com/entry/2009/nov/24/scalable-scraping-in-clojure/ for inspiration.

This is based on the Enlive library, which provides a very usable syntax for ripping through HTML (though I don’t quite understand it all yet). As I’m still a complete beginner with Clojure and functional programming in general, the following code is probably far from idiomatic and may look sloppy to you pros out there. Comments and suggestions are welcome!

;; import enlive
(use 'net.cgrand.enlive-html)
 
;; html helper
(defn fetch-url [url]
  (html-resource (java.net.URL. url)))
 
;; pulls link from paragraph
;; ie, (map get-link (select *cl* [:p]))
(defn get-link [p]
  (:href (:attrs (first (:content p)))))
 
;; pulls text of link from paragraph
(defn get-link-text [p]
  (:content (first (:content p))))
 
;; pulls text of parens following link
;; usually this is zipcode/location info
;; "", if absent
(defn get-paren-text [p]
  (let [content (:content p)]
    (if (< 2 (count content))
      (:content (nth content 2))
      "")))
 
;; pulls link/text/location into a map
(defn get-all [p]
  {:link (get-link p)
   :text (str (get-link-text p)
	      (get-paren-text p))})
 
;; some helpers to remove links we don't care about 
 
;; (affordable "$800" 600 1000) #t
;; (affordable "$1500" 600 1000) #f
(defn affordable? [text min max]
  (let [price (second (re-find #"\$(\d+)" text))]
    (if price
      (let [price (Integer/parseInt price)]
	(and (<= min price)
	     (>= max price))))))
 
;; (has-kword "downtown" (list "down")) #t
;; (has-kword "down" (list "downtown")) #f
(defn has-kword? [text kwords]
  (let [vals (map #(re-find (re-matcher (re-pattern %) text)) kwords)]
    (some #(not (= nil %)) vals)))
 
;; parameterizes a function to decide if a link is worth retrieving
;; this would be cooler if the criteria functions
;; came in as a list too.. but that makes my head
;; spin.. maybe later
(defn keep-link? [min max areas beds]
  (fn [{link :link text :text}]
    (let [text (.toLowerCase text)]
      (and link
	   (re-find #"/apa/" link)
	   (affordable? text min max)
	   (has-kword? text areas)
	   (has-kword? text beds)))))
 
;; some top level definitions
;; you may need to change these to get non-empty results
(def *url* "http://yourcity.craigslist.org/apa/")
(def *min-price* 100)
(def *max-price* 10000)
(def *areas* (list "downtown" "west side" "etc"))
(def *beds* (list "2br" "3br"))
(def my-keep-link? (keep-link? *min-price* *max-price* *areas* *beds*))
 
;; actually do the work
(filter my-keep-link? (map get-all (select (fetch-url *url*) [:p])))
 
;; References
;; 1) http://wiki.github.com/cgrand/enlive/
;; 2) http://github.com/swannodette/enlive-tutorial/
;; 3) Programming Clojure, Stuart Halloway
;; 4) lots and lots of Googling

On the whole, I’m liking Clojure a lot, but there is also a lot to learn.

(Shocking conclusion, I know!)

Posted in Programming | Tagged , , | Leave a comment

A Few Cool Videos From Google Tech Talks

I keep meaning to find some interesting podcasts and online lectures. There’s a ton of material out there, but so much of it sucks. Anyway, browsing the topic “What are the best Google Tech Talks” on Stackoverflow, I found the following, which I now link for your viewing pleasure:

XKCD visits Google – Very funny and interesting, but perhaps less enjoyable unless you’re an xkcd fanboy like me. Jump to 21:30 where xkcd answers a joking question from Donald Knuth.

PolyWorld: Using Evolution to Design Artificial Intelligence – An interesting A-Life experiment/visualization. Jump to 5:35 for some really neat video of an older program that evolves different body morphologies for efficient movement in a simulated physical environment. (I think this is the original work the speaker is citing)

The Next Generation of Neural Networks – The speaker flies through the intro material much too fast for me to understand with only a rudimentary knowledge of NN. Nevertheless, the demo at 21:35 is cool, as is the discussion around 31:40 of using these layered NN for document clustering and classification.

Posted in Random Links | Tagged | Leave a comment