Scraping Android Market Stats with Python and MozRepl


A few weeks ago I was quite keen on the idea of gathering stats and creating charts to track the popularity of my Android apps. Alas, despite digging around in various packages and experimenting with cURL, I could never seem to get logged in programmatically to the Android Marketplace Developer Console. So I gave up to continue working on my next app. Now I've come up with another reason to do some screen-scraping, so I thought I should give this another try.

Half the magic here belongs to a very cool Firefox plugin called MozRepl which lets you open a telnet connection to Firefox and interact with it via Javascript. Awesome, no?

All you have to do is ask MozRepl to go to the Developer Console, download the HTML, and run it through BeautifulSoup (the rest of the magic) to extract the data.

It turns out to be just slightly trickier because MozRepl needs to talk to Python via Telnet. I suppose this script could be setup in cron to grabs stats a couple of times each day. I think I'm just gonna run it manually every once in awhile.

import BeautifulSoup, re, time
import os, telnetlib
# Install MozRepl Plugin
# Setup MozRepl to start automatically with FF, check that port number is 4242
# Login to Developer Console once manually so login credentials get saved

# Create a new profile and set this accordingly
profile = 'my_firefox_profile'

# go to Developer Console using new profile
url = ''
os.system("firefox -no-remote -P %s %s &" % (profile, url))
time.sleep(5) #wait a sec for FF to start

#connect to MozRepl and fetch HTML
t = telnetlib.Telnet("localhost", 4242)
body = t.read_until("repl>")

#is there a better way to do this?
os.system("killall -9 firefox")

#yank stats out of HTML
now = time.strftime("%Y-%m-%d %H:%M:%S")
soup = BeautifulSoup.BeautifulSoup(body)
table = soup.find("div", { "class" : "listingTable" })
for row in table.findAll('div', {'class':'listingRow'}):
 app = row.find("div", { "class" : "listingApp" })
 rating = row.find("div", { "class" : "listingRating" })
 stats = row.find("div", { "class" : "listingStats" })
 if app and rating and stats:
   name =
   total =[0]
   active =[0]
   nratings =[1:-1]
   stars = len(rating.findAll(attrs={'style':re.compile("scroll -78px")}))
   print now, name, total, "total", active, "active", nratings, "ratings", stars, "stars"
#that's it, now maybe save these to a CSV or a log file..

I debated whether to show my actual numbers. Here you go, enjoy:

2009-04-03 17:45:15 Measure Stuff 4 total 1 active 2 ratings 1 stars
2009-04-03 17:45:15 Measure Stuff Lite 3006 total 995 active 28 ratings 2 stars
2009-04-03 17:45:15 RGB Probe 4 total 2 active 2 ratings 1 stars
2009-04-03 17:45:15 Thumb Maze 112 total 39 active 8 ratings 3 stars
2009-04-03 17:45:15 Thumb Maze Lite 16313 total 8813 active 172 ratings 3 stars
Uh oh, those numbers are not very good at all! So far my plan to live off Android looks doomed, but maybe things will pick up in the future. Two of the apps appear twice because there is a paid version and a free one. Can you tell which is which? =). Also, I think there is something wrong with RGB Probe. I've gotten a couple of e-mails saying the download failed.

So I hope folks will find this script useful. Obviously, use of this code is completely at your own risk. Screen scrapers are an arguably questionable enterprise, so don't blame me if you hose your Firefox profile or Google gets mad at you.

Also, if anyone knows the cURL incantation that will do the same thing sans Firefox, I'd love to hear it. I kept getting a 302 response and never quite figured it out. I've taken several suggestions based on other Google services that 'should work', but for some reason don't.

There are certainly pros and cons to screen scraping through the browser; I'll only point out two advantages: First, you get 'real' Javascript executed right in Firefox. With many of the big data sites being Ajax-heavy, simply fetching the HTML without executing the JS only gets you halfway there. Second, it is possible to detect and block screen scrapers by looking for unusual or suspicous request patterns. I don't know if any sites actually do this, but it could be done. For example, a simple fetch via wget looks different to a server than a fetch with Firefox and it goes beyond User-Agents. The css, images, javascript, and such will also be fetched in a particular way and a server can look for anything unusual in the order or timing with which resources are requested. Sound crazy? You're right! It probably is and I'm not sure anybody actually does this. In fact, it very possibly wouldn't work well at all in practice. For one, it could screw up text-only browsers. But I think it is still within the realm of possibility..

Now for balance, two downsides: First, the browser needs a window to run in. This means it is kinda slow, hijacks your computer for a few seconds, and doesn't really lend itself to parallelization. Second, tools like cURL and wget and many language-specific libraries are practically standard.