Half-baked Objects and 10% ORM

I’ve used Object Relational Mapping (ORM) libraries on a few projects in the past. Without getting into the many, many details, ORM bridges the gap between data storage in a relational database and Object-Oriented Programming. Simply, instead of writing SQL queries, you let the ORM library write them for you. It’s great when it works out, but like all code generators, there are some potential downsides:

  • One more library to learn
  • May generate inefficient SQL (or more efficient, in some cases)
  • If there’s a problem, you may be taking a deep dive into the code to figure it out

Whether it’s worthwhile is simply a matter of getting more out of it than you put in. As an alternative, I’ve started using a technique to build Objects on-the-fly from multi-table joins. This doesn’t handle every case (not even close!), but it does handle the cases I need.

So suppose you’ve got a webpage with Users and Posts and Comments. Each Post can have multiple Comments, and a User can “Like” a comment. A normalized version looks something like this:

CREATE TABLE users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(50)
);
CREATE TABLE posts (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(50)
);
CREATE TABLE comments (
  id INT AUTO_INCREMENT PRIMARY KEY,
  post_id INT,
  content VARCHAR(50)
);
CREATE TABLE liked_comments (
  user_id INT,
  comment_id INT
);

Now on this webpage, you want to show all of a user’s Liked Comments. So you probably have a view template that loops over the comments, showing the comment text and a link back to the Post, something like this:

<?php foreach($comments as $comment): ?>
  <div class="comment">
    <p><?php echo($comment->content); ?>
    <p>On <a href="<?php echo($comment->post->link()); ?>"><?php echo($comment->post->name); ?></a></p>
  </div>
<?php endforeach ?>

Now the question is, where should the post permalink come from? I can think of at least 3 reasonable answers:

// 1. from a method on the comment
<a href="<?php echo($comment->post_link()); ?>"><?php echo($comment->post_name()); ?></a>
 
// 2. from a method on the post
<a href="<?php echo($comment->post->link()); ?>"><?php echo($comment->post->name()); ?></a>
 
// 3. from the template, using properties of the comment
<a href="/post/<?php echo($comment->post_id); ?>"><?php echo($comment->post_name); ?></a>

I would argue that the 2nd option is the best. In the 1st option, the Comment class needs methods to handle displaying a post, which seems unnatural and leads to duplication. In the 3rd option, the View is building the URL, which is a pain if you ever want to change it later, since you’d need to update all your views. The best thing is to let the Post know how to build it’s own permalink, the method might look like this:

// in Post class
public function link() {
  return "/post/" . $this->id;
}

So how to build a list of Comments, each with a nested Post object? Here’s one possibility:

SELECT
  comments.content AS content,
  posts.id AS post_id,
  posts.name AS post_name
FROM liked_comments
JOIN comments ON liked_comments.comment_id = comments.id
JOIN posts ON comments.post_id = posts.id
WHERE user_id = 1

So that’s fine, let’s say you instantiate a Comment for each row. Something like this:

<?php
$comments = array();
$rs = mysql_query($sql);
while($row = mysql_fetch_assoc($rs)) {
  $comments[] = new Comment($row);
}
?>

So that creates a list of comments for our View. All that’s missing is to instantiate a nested Post for each Comment. This can be done in the Comment constructor:

public function __construct($args=NULL) {
  if($args && is_array($args)) {
    if(array_key_exists('post_id', $args) && array_key_exists('post_name', $args)) {
      $this->post = new Post();
      $this->post->id = $args['post_id'];
      $this->post->name = $args['post_name'];
    }
  }
  // other stuff..
}

So when we instantiate a Comment and provide the appropriate keys (post_id and post_name), it will instantiate a Post for us. It’s not really a proper Post, but more of a half-baked object. It doesn’t have an author, content or other things you might expect in a Post; instead, it has just the two keys to know how to display its permalink.

Now this works fine, but having a bunch of hacked-up constructors isn’t very nice and we’re still requiring the Comment class to know something about the structure of Posts. A better alternative is to make a super class with a more generic constructor that can be used by any class to instantiate any other class (or classes) based only on the row names. Here is the more generic version I am currently using:

// in a base class
function __construct($row, $params=NULL)
{
    foreach($row as $k=>$v) {
      $this->$k = $v;
    }
    $klass_map = NULL;
    if( $params ) {
        if(array_key_exists('klass_map', $params)) {
            $klass_map = $params['klass_map'];
        } 
    }
    $vars = get_object_vars($this);
    foreach($vars as $k=>$v)
    {
        $split = strpos($k, '_');
        if( $split === FALSE ) {
            continue;
        } else {
            $prefix = substr($k, 0, $split);
            $postfix = substr($k, $split+1);
            if( ! isset($this->{$prefix}) ) {
                if( $klass_map && array_key_exists($prefix, $klass_map) ) {
                    $this->{$prefix} = new $klass_map[$prefix];
                } else {
                    $this->{$prefix} = new stdClass;
                }
            }
            $this->{$prefix}->{$postfix} = $v;
            unset($this->$k);
        }
    }
    //echo('<pre>');
    //exit(print_r($this));
}

Well that looks a little more complicated. Basically, it just looks to see if there is an underscore in each property name, and if there is, it tries to instantiate that property as a class. A mapping tells it which prefixes go with which classes. For example:

$this->post_id becomes $this->post->id
$this->post_name becomes $this->post->name
$this->user_id becomes $this->user->id
$this->content just stays the same (no underscore)

So how to use that constructor? Something like this:

<?php
$params = array(
  'klass_map' => array(
    'post' => 'Post', // post_ prefix maps to Post class
   ),
);
$comments = array();
$rs = mysql_query($sql);
while($row = mysql_fetch_assoc($rs)) {
  $comments[] = new Comment($row, $params);
}
?>

The key observation is that $this->post is not a generic stdClass, but an instance of Post that has been created with only the properties we know we’re gonna need.

There are some obvious downfalls here:

First, using magic constructors can make things unnecessarily complicated and may cause conflicts with libraries that do their own magic. Adding/removing (unsetting) properties seems particularly hazardous.

Second, you have to write your SQL carefully so you get the row names and mappings you need. In particular, row names like “modified_on” would not behave as expected. It should be easy to tweak the generic constructor to be a bit more robust.

Also, this really only handles the case of these nested 1:1 mappings. I think you could extend the idea, which is fairly useful by itself, but I would bet it gets complicated quickly as you head towards real ORM territory.

Despite the shortcomings, I’m finding this to be a convenient way to construct objects on-the-fly at the early prototyping stages of a project when I’m constantly renaming things and moving code around.

Posted in Programming | Tagged , , | Leave a comment

Worse Is Better

Some interesting essays from computer history: Worse Is Better. The original essay considers the success of C against the arguably superior Lisps, which failed to gain widespread popularity. It seems hard to predict whether a particular product will succeed using Worse Is Better – lots of times, worse just sucks – but it’s useful in retrospect to see why a product wins.

Posted in Random Links | Leave a comment

Random Links

This is super old and super cool – an article from 2007 in the New York Times about using Amazon/EC2/S3/Hadoop to produce static PDFs from 71 years (4TB) of NYT archives. I don’t think I had even heard of Hadoop in 2007.

Posted in Programming, Random Links | Leave a comment

A Little Trick For Dealing With Lots Of View Files

I’ve been messing around with the Kohana Framework a lot lately. The simplicity of using the View Factory has led me to have a ton of little view snippets and it gets tricky remembering where they all live. So I started adding this line to the top of all my views:

<?php if(Kohana::$environment !== 'production'): echo('<!-- '.__FILE__.' -->'); endif; ?>

When there’s a problem, I can hit “view-source” and instantly locate the offending file. Not exactly some brilliant revolution, but it does come in handy.

Three more thoughts:

  • A better solution is probably to subclass View and do this automatically, instead of appending the beginning of every file
  • Doing this in production code could leave hints to attackers about the layout of your webserver. The environment check should mitigate that, but still be careful.
  • The HTML comment may cause a problem if you’re using it for AJAX to generate HTML on the fly – I need to look into that further
Posted in Programming | Tagged | Leave a comment

… And We’re Back

Sorry about the delay folks. I meant to take the site offline for a day or two, but it ended up being a couple of months! You know how it goes – you get distracted working on other stuff and kinda just forget.. Anyway, now it’s back and I’m planning to write a bit more frequently.

I nuked everything when I took it offline; however, at least a couple of posts were actually useful to people, so those are being reposted and back-dated to their original publication dates. A lot of the content was just me thinking around a problem without getting into the specifics – those posts have not been restored.

Looking at my writing, I noticed that I tend towards the abstract in discussing programming problems and don’t post enough code. Hoping to remedy that going forward. Stay tuned!

Posted in Uncategorized | Leave a comment

Ranking Blogs by Readability

Further proof that I’m a dork: this afternoon, instead of working on my apps, I was screwing around with pydot and matplotlib making visualizations of user engagement (on another blog). At one point it occurred to me that it might be interesting to plot the reading grade level of various blogs. So here goes:

Determining reading level is tricky business because there are so many different kinds of texts. Most methods seem to boil down to some combination of sentence length, word length, number of syllables and some magic numbers: throw it all together and you’ve got a score. Different scoring systems measure slightly different things, but it usually ends up as either a grade level or some numeric measure of reading difficulty or ease. For more background, wikipedia is happy to go into great depth: Flesch-Kincaid Readability Test. I admit to kinda skimming the article because there was no way I was gonna implement the various metrics anyway.

Happily, there’s GNU style, a command line program which has already done the dirty work. Using the previous sentence, it outputs (abbreviated):

...
Kincaid: 6.0
ARI: 8.7
Coleman-Liau: 12.5
Flesch Index: 78.8/100
Fog Index: 6.0
Lix: 48.3 = school year 9
SMOG-Grading: 3.0
...

Hot.

The goal is to rank the reading difficulty of some blogs. So here’s the plan:

1. Get a list of blogs

2. Download each blog’s RSS feed

3. Run the combined content of each blog through style

4. Parse the results to get scores

5. Make pretty graphs

6. Draw unnecessarily broad conclusions

The blogs used in my experiment are: Gawker, TMZ, TheAwl, Treehugger, EnGadget, ABC News, Huffington Post, Wired and Go Fug Yourself – a mix of news news, computer news, and celebrity news. The list is kinda random and kinda also pulled from the list given in the book Programming Collective Intelligence, which peripherally inspired this idea. Certainly it would be interesting to repeat the experiment on more blogs covering a wider range of subjects and intended for different audiences.

This first graph is the Fog Index, which corresponds to something like grade level.

Since this program was pieced-together in a couple of hours this afternoon, there are plenty of deficiencies. For example, readability metrics tend to assume that you’re working with something like normal paragraphs. When you’re dealing with blogs, that’s not necessarily the case, as you get things like Top-10 lists and Image-only posts. Annoyingly, many blogs only give a one sentence summary in the feed, instead of the full content. Not having enough sample text throws off the metrics. My program doesn’t check for that or a host of other big and little things. On the other hand, the graphs kinda agree with what you’d expect, so I think there is some merit to the general method. Maybe something to explore later..

And, here’s the source. I’ll be the first to admit that it is kind of a trainwreck and could use substantial cleanup, but it does work! You are free to use this code in any way you see fit. But if you do something stupid with it, that’s not my fault..

You need a small handful of things for this to work: python, matplotlib, numpy, feedparser, and also GNU style and probably Linux, though it might work on a Mac or Windows? Dunno..

# updated 7-20-2011 to parse feeds a bit more robustly
# based partially on Mining the Social Web
# http://www.amazon.com/dp/1449388345
 
# fetch feeds
import os
import urllib2
import feedparser
 
# pipe to GNU style
from subprocess import Popen, PIPE
 
# clean up html
from nltk import clean_html
from BeautifulSoup import BeautifulStoneSoup
 
# plotting things
import matplotlib.pyplot as plot
import numpy.numarray as na
from pylab import get_cmap
 
# stfu unicode decode error
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
 
 
def cleanhtml(html):
    return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
 
def get_score(text):
    score = {'kincaid': 0, 'ari': 0, 'coleman': 0, 'flesch': 0, 'fog': 0, 'lix': 0, 'smog': 0, }
    sp = Popen(["style"], stdin=PIPE, stdout=PIPE)
    sp.stdin.write(text)
    sp.stdin.close()
    result = sp.stdout.read()
    lines = result.split("\n")
    for line in lines:
        if line.find('Kincaid:') > 0:
            parts = line.split('Kincaid:')
            score['kincaid'] = float(parts[1])
        elif line.find('ARI:') > 0:
            parts = line.split('ARI:')
            score['ari'] = float(parts[1])
        elif line.find('Coleman-Liau:') > 0:
            parts = line.split('Coleman-Liau:')
            score['coleman'] = float(parts[1])
        elif line.find('Flesch Index:') > 0:
            parts = line.split('Flesch Index:')
            parts = parts[1].split('/')
            score['flesch'] = float(parts[0])
        elif line.find('Fog Index:') > 0:
            parts = line.split('Fog Index:')
            score['fog'] = float(parts[1])
        elif line.find('Lix:') > 0:
            parts = line.split('Lix:')
            parts = parts[1].split('=')
            score['lix'] = float(parts[0])
        elif line.find('SMOG-Grading:') > 0:
            parts = line.split('SMOG-Grading:')
            score['smog'] = float(parts[1])
        else:
            pass
    return score
 
 
 
full_feeds = {}
 
# fetch RSS for URLs in file "feeds.txt"
FEEDS = 'feeds.txt'
feeds = open(FEEDS).readlines()
for feed in feeds:
    fp = feedparser.parse(feed)
    blog_posts = []
    for e in fp.entries:
        if e.has_key('content'):
            blog_posts.append({'title': e.title, 'content': cleanhtml(e.content[0].value), 'link': e.links[0].href})
        elif e.has_key('summary_detail'):
            blog_posts.append({'title': e.title, 'content': cleanhtml(e.summary_detail.value), 'link': e.links[0].href})
    if blog_posts:
        text = ''.join(post['content'] for post in blog_posts)
        full_feeds[feed] = text
 
scores = {'kincaid': {}, 'ari': {}, 'coleman': {}, 'flesch': {}, 'fog': {}, 'lix': {}, 'smog': {} } 
 
# calculate reading ease score for each blog
for feed,text in full_feeds.items():
    score = get_score(text)
    name = feed.replace('http://','').replace('www','').replace('.com','').replace('feed','')
    name = name.split('/')[0]
    for k,v in score.items():
        scores[k][name] = v
    print "%s %s %s %s %s %s %s %s" % (name, score['kincaid'], score['ari'], score['coleman'], score['flesch'], score['fog'], score['lix'], score['smog'])
 
# plot results
color_map = get_cmap('gist_rainbow')
for kind,vals in scores.items():
    vals = [(k,v) for k,v in vals.items()]
    labels = [y[0] for y in vals]
    width = 0.5
 
    y1s = [y[1] for y in vals]
    x1s = na.array(range(len(y1s)))+width
    colors = [color_map(1.*i/len(x1s)) for i in range(len(x1s))]
 
    for x, y, c in zip(x1s, y1s, colors):
        plot.bar(x, y, width=width, color=c)
 
    plot.title(kind)
    plot.xticks(x1s + width/2, labels, rotation=270)
 
    # save, clear plot, clear axes
    plot.savefig("scores-%s.png" % kind)
    plot.clf()
    plot.cla()

Also, you need a file called “feeds.txt” to specify which feeds you want to compare, here’s the one I used for this post:


http://feeds.feedburner.com/TheAwl?format=xml

http://feeds.gawker.com/gawker/full

http://www.tmz.com/rss.xml

http://feeds2.feedburner.com/celebuzz/Kggb

http://rss.slashdot.org/Slashdot/slashdot

http://feeds.huffingtonpost.com/huffingtonpost/raw_feed

http://blogs.abcnews.com/theblotter/index.rdf

http://www.engadget.com/rss.xml

http://www.treehugger.com/index.rdf

* I was getting some unusual results from CNN and BBC feeds with scores dramatically outside the range of the other blogs. I haven’t really looked into it yet, but just a note – this method seems pretty fragile.

Posted in Programming | Tagged | Leave a comment

Automatically Converting PNGs to JPEGs in WordPress

This took me entirely too long to figure out.. The goal was to reduce the load page and server load by converting PNGs to JPEGs. This method may be lack a bit of subtlety, as it *always* converts PNGs to JPEGs. Maybe sometimes you actually do want a PNG. Something to consider for the future.. anyway, dropping this in your functions.php or wrapping it in a plugin should do the trick:

add_action('wp_handle_upload', 'my_resample_handle_upload');
function my_resample_handle_upload($arr) {
	if($arr['type'] != 'image/png') {
		return $arr;
	}
	$file = $arr['file'];
	$url = $arr['url'];
	$dst_file = substr($file, 0, -3) . 'jpg';
	$dst_url = substr($url, 0, -3) . 'jpg';
	list($width, $height, $type, $attr) = getimagesize($file);
	$image = imagecreatefrompng($file);
	$new_image = imagecreatetruecolor($width, $height);
	imagecopyresampled($new_image, $image, 0, 0, 0, 0, $width, $height, $width, $height);
	imagejpeg($new_image, $dst_file);  
	imagedestroy($new_image);
	$arr['file'] = $dst_file;
	$arr['url'] = $dst_url;
	$arr['type'] = 'image/jpeg';
	return $arr;
}

Seems to do the job. Lemme know if you run into difficulties or see problems with this method.

Posted in Programming, Wordpress | Tagged , | Leave a comment

An Easy Munin PHP Plugin

I’ve been playing sysadmin lately, which is not my favorite thing in the world. Anyway, while trying to read up a bit on monitoring, I came across Munin – one of those apps I’ve heard of but never really gotten around to investigating properly. Well, it is pretty darn cool – and you can write your own plugins – and I love writing plugins.

I suppose it may be a bit more complicated, but from my perspective, it collects stats every 5 minutes and makes pretty graphs with rrdtool. In the simplest configuration, one computer is both the client (a munin-node, which collects its own stats) and the server (munin, which aggregates stats for all the clients). Then you leave it running for a few days, weeks or years and you’ll have a good idea of what your system should look like when things are running smoothly and when they are not. Hopefully this will allow you to anticipate and avoid problems, or at least understand what happened afterwards.

So that’s neat. Here’s a graph: (the empty area is where a server was rebooted and munin-node wasn’t set to restart automatically)

So every 5 minutes, munin-node activates itself and runs all plugins you’ve got setup. Each plugin outputs a label and a number (or multiple labels and numbers), which the server collects and uses to make graphs. So writing your own plugin is super easy. Here is one in PHP that graphs the number of registered users in a WordPress database.

#!/usr/bin/php
<?php
 
// these are setup in the plugin config, or you could hardcode them
$db = getenv('DB');
$host = getenv('HOST');
$user = getenv('USER');
$pass = getenv('PASS');
$prefix = getenv('PREFIX');
 
// this is for munin's configuration tool
// could do something more complicated here
if(count($argv) == 2 && $arv[1] == 'authconf') {
	exit('yes');
}
// this is for rrdtool to know how to label the graph
if(count($argv) == 2 && $argv[1] == 'config') {
	echo("graph_title Users\n");
	echo("graph_vlabel count\n");
	echo("graph_category Wordpress\n");
	echo("wordpress_users.label Registered Users\n");
	echo("wordpress_users.type COUNTER\n");	
	exit();
}
 
// this is the usual case, generating a label and value
mysql_connect($host, $user, $pass);
mysql_select_db($db);
$rs = mysql_query("SELECT COUNT(*) FROM {$prefix}users");
$rows = mysql_fetch_array($rs);
echo("wordpress_users.value {$rows[0]}\n");

Obviously you’ll need to have a command-line PHP available at /usr/bin/php.

To install it, put the file in the available plugins directory and symlink it to the active plugins directory:

cp /path/to/plugin /usr/share/munin/plugins/your_plugin_name
ln -s /usr/share/munin/plugins/your_plugin_name /etc/munin/plugins/your_plugin_name

So that’s pretty easy. There is also the plugin configuration where you setup the environment variables. This is particularly handy, as you could use multiple configurations of the same plugin to provide stats about different databases.

In this case, I just edited /etc/munin/plugin-conf.d/munin-node and created a new section matching my plugin’s name, but I think there are other ways to do it for more complex configurations:

[your_plugin_name]
env.DB db_name_here
env.HOST localhost
env.USER db_user_here
env.PASS db_pass_here
env.PREFIX table_prefix_here

And finally, finally, you can test it to make sure it is working using munin-run. This is important because simply running it by self will not setup the environment variables. It wasn’t on my path, but ended up in /usr/sbin:

/usr/sbin/munin-run your_plugin_name
wordpress_users.value 210

Oh yeah. When you’re done testing and want to add your plugin for real, you need to restart munin-node:

/etc/init.d/munin-node restart

So I think it’s a pretty cool and easy way to monitor just about anything. As always, there may be errors or inaccuracies, so munin experts, please offer corrections if you see a problem.

References:
My PHP plugin was based heavily on: http://thomasfischer.biz/?p=174

Posted in Programming | Tagged , | Leave a comment

Making My Own WordPress Chartbeat Plugin

Instead of doing something useful this morning, I made my own little plugin using the Chartbeat API to display the most popular posts on a WordPress blog.

Note: There is really no reason to do this. The Chartbeat Plugin does this exact same thing and more. However, it was an entertaining exercise for me to practice writing wordpress plugins.

Also Note: This only works if you have signed up for Chartbeat and get an API Key.

The reason this is cool? Well, most of your “most popular posts” plugins need to make an extra call to the database to get/set a counter because wordpress doesn’t track page views by default. But if you’re using chartbeat to track your blog’s performance, you can save some effort by using their numbers instead.

And with no further ado, here’s the code:

<?php
/*
Plugin Name: Ct Most Popular
Plugin URI: http://www.craiget.com
Description: Display most viewed posts using the Chartbeat API, exposes one function: ct_most_popular_plugin_widget(); 
Version: 0.1
Author: Craige
Author URI: http://craiget.com
License: For example and testing purposes. Not suggested for use on a real site.
*/
 
$ct_most_popular_plugin_version = "0.1";
 
$ct_most_popular_plugin_data = array();
 
// create a most_popular option
register_activation_hook(__FILE__, 'ct_most_popular_plugin_install');
function ct_most_popular_plugin_install()
{
	add_option("ct_most_popular_plugin_data", $ct_most_popular_data);
	add_option("ct_most_popular_plugin_version", $ct_most_popular_plugin_version);
	// schedule hourly update
	wp_schedule_event(time(), 'hourly', 'ct_most_popular_plugin_update_event');
}
 
// delete the most_popular option
register_deactivation_hook(__FILE__, 'ct_most_popular_plugin_uninstall');
function ct_most_popular_plugin_uninstall()
{
	delete_option("ct_most_popular_plugin_data");
	delete_option("ct_most_popular_plugin_version");
	// un-schedule hourly update
	wp_clear_scheduled_hook('ct_most_popular_plugin_update_event');
}
 
// appear under "Settings" on the admin page
add_action('admin_menu', 'ct_most_popular_plugin_menu');
function ct_most_popular_plugin_menu() {
	add_options_page('Ct Most Popular', 'Ct Most Popular', 'manage_options', '', 'ct_most_popular_plugin_options');
}
 
// init option values in db
add_action('admin_init', 'ct_most_popular_plugin_options_init' );
function ct_most_popular_plugin_options_init(){
	register_setting('ct_most_popular_plugin_options', 'ct_most_popular_plugin', 'ct_most_popular_plugin_validate' );
}
 
// sanitize and validate input
function ct_most_popular_plugin_validate($input) {
	$input['host'] =  wp_filter_nohtml_kses($input['host']);
	$input['chartbeat_api_key'] =  wp_filter_nohtml_kses($input['chartbeat_api_key']);
	$input['limit'] =  (int)($input['limit']);
	if($input['limit'] == 0) $input['limit'] = 10;
	return $input;
}
 
// display options page html
function ct_most_popular_plugin_options() {
	if (!current_user_can('manage_options'))  {
		wp_die(__('You do not have sufficient permissions to access this page.') );
	}
?>
<div class="wrap">
	<h2>Ct Most Popular Plugin Options Title</h2>
	<form method="post" action="options.php">
		<?php settings_fields('ct_most_popular_plugin_options'); ?>
		<?php $options = get_option('ct_most_popular_plugin'); ?>
		<table class="form-table">
		<tr valign="top">
			<th scope="row">Host</th>
			<td><input type="text" name="ct_most_popular_plugin[host]" value="<?php echo $options['host']; ?>" /></td>
			<td><i>ie, example.com</i></td>
		</tr>
		<tr valign="top">
			<th scope="row">Chartbeat API Key</th>
			<td><input type="text" name="ct_most_popular_plugin[chartbeat_api_key]" value="<?php echo $options['chartbeat_api_key']; ?>" /></td>
			<td><i><a href="http://chartbeat.com/apikeys/">http://chartbeat.com/apikeys/</a></i></td>
		</tr>
		<tr valign="top">
			<th scope="row">Limit</th>
			<td><input type="text" name="ct_most_popular_plugin[limit]" value="<?php echo $options['limit']; ?>" /></td>
			<td><i>number of items to show, 10</i></td>
		</tr>
		</table>
		<p class="submit">
			<input type="submit" class="button-primary" value="<?php _e('Save Changes') ?>" />
		</p>
		<p>
		This plugin uses the <a href="http://chartbeat.pbworks.com/">Chartbeat API</a> to show the most popular pages on your site, updated hourly.
		</p>
		<p>
		This plugin was created for my own amusement and to practice creating Wordpress plugins, it is <strong>NOT RECOMMENDED</strong> for use.
		</p>
		<p>
		Chartbeat has released a perfectly good plugin that does this and more: <a href="http://wordpress.org/extend/plugins/chartbeat/">http://wordpress.org/extend/plugins/chartbeat/</a>
		</p>
		<p>
		This plugin fetches new data once every hour using Wordpress's built-in <a href="http://codex.wordpress.org/Function_Reference/wp_schedule_event">scheduling hooks</a> to update the list of popular posts hourly.
		This keeps things self-contained, but doesn't provide much flexibility. You may want to use cron instead, which would require a little hacking.
		</p>
	</form>
</div>
<?php
}
 
// get popularity data from chartbeat, store in db
add_action('ct_most_popular_plugin_update_event', 'ct_most_popular_plugin_update_chartbeat');
function ct_most_popular_plugin_update_chartbeat() {
	// construct chartbeat call
	$options = get_option('ct_most_popular_plugin');
	$host = $options['host'];
	$apikey = $options['chartbeat_api_key'];
	$limit = $options['limit'];
	// build url
	$url = 'http://api.chartbeat.com/toppages/?host=HOST&limit=LIMIT&apikey=APIKEY';
	$url = str_replace('HOST', $host, $url);
	$url = str_replace('APIKEY', $apikey, $url);
	$url = str_replace('LIMIT', $limit, $url);
	// fetch data
	$data = file_get_contents($url);
	$data = json_decode($data, true);
	// exit if not enough results back
	if(count($data) < $limit)
		return;
	$result = array();
	for($i=0; $i<count($data); $i++) {
		if($data[$i]['path'] == "/")
			continue;
		$result[] = $data[$i];
	}
	$result = array_slice($result, 0, $limit);
	// store in db
	update_option("ct_most_popular_plugin_data", $result);
}
 
// add this function in your sidebar
function ct_most_popular_plugin_widget() {
	$data = get_option("ct_most_popular_plugin_data");
	echo('<ul>');
	foreach ($data as $post) {
		echo('<li>');
		echo('<a href="'.$post['path'].'">'.$post['visitors'].'-'.$post['i'].'</a>');
		echo('</li>');
	}
	echo('</ul>');
}

Go to “Settings” > “Ct Most Popular” to set your API Key and other options.

Updates occur once each hour.

You’ll almost certainly want to tweak the way the posts are displayed in the ct_most_popular_plugin_widget() function.

Anyway.. just fooling around.. For all the frustration it has caused me.. Still gotta say, WordPress is pretty friggin’ cool.

Posted in Programming, Wordpress | Tagged , , | Leave a comment

Fetching Android Market Stats with Selenium RC

Finally.. I’ve got a reasonably decent way to pull Android Market stats. For some reason I keep coming back to this topic. This time, the way forward is to use Selenium RC, part of the Selenium browser testing suite.

My example will be in Python, but Selenium has bindings for several languages.

First of all, you gotta download Selenium RC from here: http://seleniumhq.org/download/

Then, extract it someplace you can remember. I’ve been putting things in ~/opt lately.

Okay, now create a new python script, comma ca:

import sys
sys.path.append('/the/path/to/selenium-python-client-driver-1.0.1')
 
from selenium import selenium
 
email = 'YOUR_GOOGLE_LOGIN'
passwd = 'YOUR_PASSWORD'
 
s = selenium("localhost", 4444, "*firefox", "http://market.android.com")
s.start()
s.open("/publish/Home")
s.type("Email", email)
s.type("Passwd", passwd)
s.click("signIn")
s.wait_for_page_to_load("30000")
 
n = int(s.get_xpath_count("//div[@class='listingRow']"))
for i in range(3,n):
  try:
    title = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[1]/div[1]" % i)
    downloaded = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[2]/div[1]/span[1]" % i)
    installed = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[2]/div[2]/span[1]" % i)
    comments = s.get_text("xpath=(//div[@class='listingRow'])[%s]/table" % i)[1:-1]
    print title, downloaded, installed, comments
  except:
    pass

* Be sure to fill in YOUR_GOOGLE_LOGIN with your email (or whatever login) and the matching password.

This script is a bit of a trainwreck.. but it works and I don’t feel like screwing with it..

* Working with xpath in selenium-rc’s python binding feels really weird.. doesn’t seem to behave quite the way you would expect.

* Why does the iteration start at 3? I dunno.. there are some empty rows at the beginning I guess..

* Why is it wrapped in a try-except block? I dunno.. some empty rows at the end?

* It works on Ubuntu 10.04 / FF 3.6.3. Your mileage may vary. I wouldn’t be surprised if those xpath selectors needed more tweaking in some cases.

To run the script, you need to start the Selenium RC server. Go to the place you downloaded it:

cd /path/to/selenium
java -jar selenium-server.jar

Then, you should be able to run this script from a terminal and it will start firefox, log you in to the Android Developer Console, wait a few seconds til the Ajax all loads, then use xpath to scrape each row of data from the table and print it to the terminal.

From there it should be pretty simple to export the results into a CSV file or make pretty charts or whatever it is you wanna do.

It does pop up a window on the screen, which is kinda annoying. Cooler to run firefox headless, maybe some other time..

Posted in Programming | Tagged , | Leave a comment