Ranking Blogs by Readability

(2011)

Further proof that I'm a dork: this afternoon, instead of working on my apps, I was screwing around with pydot and matplotlib making visualizations of user engagement (on another blog). At one point it occurred to me that it might be interesting to plot the reading grade level of various blogs. So here goes:

Determining reading level is tricky business because there are so many different kinds of texts. Most methods seem to boil down to some combination of sentence length, word length, number of syllables and some magic numbers: throw it all together and you've got a score. Different scoring systems measure slightly different things, but it usually ends up as either a grade level or some numeric measure of reading difficulty or ease. For more background, wikipedia is happy to go into great depth: Flesch-Kincaid Readability Test. I admit to kinda skimming the article because there was no way I was gonna implement the various metrics anyway.

Happily, there's GNU style, a command line program which has already done the dirty work. Using the previous sentence, it outputs (abbreviated):

...
Kincaid: 6.0
ARI: 8.7
Coleman-Liau: 12.5
Flesch Index: 78.8/100
Fog Index: 6.0
Lix: 48.3 = school year 9
SMOG-Grading: 3.0
...
Hot.

The goal is to rank the reading difficulty of some blogs. So here's the plan:

  1. Get a list of blogs

  2. Download each blog's RSS feed

  3. Run the combined content of each blog through style

  4. Parse the results to get scores

  5. Make pretty graphs

  6. Draw unnecessarily broad conclusions

The blogs used in my experiment are: Gawker, TMZ, TheAwl, Treehugger, EnGadget, ABC News, Huffington Post, Wired and Go Fug Yourself - a mix of news news, computer news, and celebrity news. The list is kinda random and kinda also pulled from the list given in the book Programming Collective Intelligence, which peripherally inspired this idea. Certainly it would be interesting to repeat the experiment on more blogs covering a wider range of subjects and intended for different audiences.

This first graph is the Fog Index, which corresponds to something like grade level.

scores-fog

Since this program was pieced-together in a couple of hours this afternoon, there are plenty of deficiencies. For example, readability metrics tend to assume that you're working with something like normal paragraphs. When you're dealing with blogs, that's not necessarily the case, as you get things like Top-10 lists and Image-only posts. Annoyingly, many blogs only give a one sentence summary in the feed, instead of the full content. Not having enough sample text throws off the metrics. My program doesn't check for that or a host of other big and little things. On the other hand, the graphs kinda agree with what you'd expect, so I think there is some merit to the general method. Maybe something to explore later..

And, here's the source. I'll be the first to admit that it is kind of a trainwreck and could use substantial cleanup, but it does work! You are free to use this code in any way you see fit. But if you do something stupid with it, that's not my fault..

You need a small handful of things for this to work: python, matplotlib, numpy, feedparser, and also GNU style and probably Linux, though it might work on a Mac or Windows? Dunno..

# updated 7-20-2011 to parse feeds a bit more robustly
# based partially on Mining the Social Web
# http://www.amazon.com/dp/1449388345

# fetch feeds
import os
import urllib2
import feedparser

# pipe to GNU style
from subprocess import Popen, PIPE

# clean up html
from nltk import clean_html
from BeautifulSoup import BeautifulStoneSoup

# plotting things
import matplotlib.pyplot as plot
import numpy.numarray as na
from pylab import get_cmap

# stfu unicode decode error
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def cleanhtml(html):
 return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

def get_score(text):
 score = {'kincaid': 0, 'ari': 0, 'coleman': 0, 'flesch': 0, 'fog': 0, 'lix': 0, 'smog': 0, }
 sp = Popen(["style"], stdin=PIPE, stdout=PIPE)
 sp.stdin.write(text)
 sp.stdin.close()
 result = sp.stdout.read()
 lines = result.split("\n")
 for line in lines:
   if line.find('Kincaid:') > 0:
     parts = line.split('Kincaid:')
     score['kincaid'] = float(parts[1])
   elif line.find('ARI:') > 0:
     parts = line.split('ARI:')
     score['ari'] = float(parts[1])
   elif line.find('Coleman-Liau:') > 0:
     parts = line.split('Coleman-Liau:')
     score['coleman'] = float(parts[1])
   elif line.find('Flesch Index:') > 0:
     parts = line.split('Flesch Index:')
     parts = parts[1].split('/')
     score['flesch'] = float(parts[0])
   elif line.find('Fog Index:') > 0:
     parts = line.split('Fog Index:')
     score['fog'] = float(parts[1])
   elif line.find('Lix:') > 0:
     parts = line.split('Lix:')
     parts = parts[1].split('=')
     score['lix'] = float(parts[0])
   elif line.find('SMOG-Grading:') > 0:
     parts = line.split('SMOG-Grading:')
     score['smog'] = float(parts[1])
   else:
     pass
 return score

full_feeds = {}

# fetch RSS for URLs in file "feeds.txt"
FEEDS = 'feeds.txt'
feeds = open(FEEDS).readlines()
for feed in feeds:
 fp = feedparser.parse(feed)
 blog_posts = []
 for e in fp.entries:
   if e.has_key('content'):
     blog_posts.append({'title': e.title, 'content': cleanhtml(e.content[0].value), 'link': e.links[0].href})
   elif e.has_key('summary_detail'):
     blog_posts.append({'title': e.title, 'content': cleanhtml(e.summary_detail.value), 'link': e.links[0].href})
 if blog_posts:
   text = ''.join(post['content'] for post in blog_posts)
   full_feeds[feed] = text

scores = {'kincaid': {}, 'ari': {}, 'coleman': {}, 'flesch': {}, 'fog': {}, 'lix': {}, 'smog': {} }

# calculate reading ease score for each blog
for feed,text in full_feeds.items():
 score = get_score(text)
 name = feed.replace('http://','').replace('www','').replace('.com','').replace('feed','')
 name = name.split('/')[0]
 for k,v in score.items():
   scores[k][name] = v
   print "%s %s %s %s %s %s %s %s" % (name, score['kincaid'], score['ari'],    score['coleman'], score['flesch'], score['fog'], score['lix'], score['smog'])

# plot results
color_map = get_cmap('gist_rainbow')
for kind,vals in scores.items():
 vals = [(k,v) for k,v in vals.items()]
 labels = [y[0] for y in vals]
 width = 0.5

y1s = [y[1] for y in vals]
x1s = na.array(range(len(y1s)))+width
colors = [color_map(1.*i/len(x1s)) for i in range(len(x1s))]

for x, y, c in zip(x1s, y1s, colors):
 plot.bar(x, y, width=width, color=c)

plot.title(kind)
plot.xticks(x1s + width/2, labels, rotation=270)

# save, clear plot, clear axes
plot.savefig("scores-%s.png" % kind)
plot.clf()
plot.cla()

Also, you need a file called "feeds.txt" to specify which feeds you want to compare, here's the one I used for this post:

http://feeds.feedburner.com/TheAwl?format=xml
http://feeds.gawker.com/gawker/full
http://www.tmz.com/rss.xml
http://feeds2.feedburner.com/celebuzz/Kggb
http://rss.slashdot.org/Slashdot/slashdot
http://feeds.huffingtonpost.com/huffingtonpost/raw_feed
http://blogs.abcnews.com/theblotter/index.rdf
http://www.engadget.com/rss.xml
http://www.treehugger.com/index.rdf
* I was getting some unusual results from CNN and BBC feeds with scores dramatically outside the range of the other blogs. I haven't really looked into it yet, but just a note - this method seems pretty fragile.