Your Android Developer Account Will Live Forever

I’ve been shutting down my app business since it doesn’t really make enough money to be worth the hassle of properly running a business, filing taxes, etc.. Part of that process is closing all my accounts, including the Android Developer account. Well, apparently that is not possible. Following a couple rounds of emails to Google, they say the account cannot be archived or deleted. The best you can do is to unpublish all your apps and change your password to something random. I was met with a similar surprise at the end of last year when I tried to close an account with an advertiser – “YOU WANT TO DO WHAT?!?” – they have thousands of clients, but apparently no one had ever asked to close their account before.

Uncool.

Posted in Uncategorized | Leave a comment

Automatically Save HTML Of Every Page You Visit

For the last couple of weeks, I’ve been thinking about the best way to capture the HTML of every webpage I visit. Sure, you can always write a screen scraper or bot, but I guess I wanted something a little more organic.

The right answer to this problem is probably: caching proxy. Alternatively, tcpdump or some cleverness with copying the temporary files from the browser cache might also work. However, I think there’s a strong case for using the browser directly: first, you get nice cleaned-up HTML, and second, you get javascript execution (handy if there is ajax stuff on the page or if you want to use jQuery for pre-processing the HTML).

Basically, you want this:

  • Load a webpage as normal
  • Inject an additional script to..
  • Grab the DOM as a string
  • POST to a webserver to save it for processing later (avoiding cross-domain rules)

In Firefox you’ve got Greasemonkey and User Scripts. These work in Chrome too, but it seems like the cross-domain restriction may be problematic. I didn’t investigate too much further after reading that there might be a problem. Happily, if you write a properĀ full-on Chrome Extension, you can specify exceptions to the cross-domain rules.

So, following is the script I pieced together this morning. It’s a chrome extension that grabs the source of every page you load (using jQuery’s DOM methods). Then it POSTs to your local webserver. My example below is pretty minimal just to demonstrate that it works. Maybe someday I’ll package it as a real extension, make it configurable and release it, but, you know, probably not.

Use at your own risk and all the usual disclaimers. Also, you should probably lock down the permissions and matches attributes to only run on your local server against the pages you’re interested in.

manifest.json

{
  "name": "Capture HTML and POST to local server",
  "version": "0.0.1",
  "description": "Capture HTML and POST to local server",
  "permissions": [
    "http://*/*"
  ],
  "content_scripts": [
    {
      "matches": ["http://*/*"],
      "js" : ["jquery.min.js","contentscript.js"],
      "run at":"document_end"
    }
  ],
  "background_page": "background.html"
}

contentscript.js

function captureHTML() {
    var html = '<html>' + $('html').html() + '</html>';
    chrome.extension.sendRequest({html: html}, function(response) {
        alert(response.result);
    });
}
captureHTML();

background.html

<html>
<head>
<script type="text/javascript" src="jquery.min.js"></script>
<script type="text/javascript">// <![CDATA[
 
    chrome.extension.onRequest.addListener(
        function(request, sender, sendResponse) {
            var html = request.html;
            var url = 'http://localhost/recv.php';
            var data = {html:html};
            $.post(url, data, function(result) {
                sendResponse({result: result});                    
            });
    });
 
// ]]></script>
</head>
</html>

recv.php

<?php
$html = $_POST['html'];
$result = strlen($html);
echo ($result);
error_log($html);

Also, you will need to download a copy of the latest minimized jQuery and save it into the extension folder as jquery.min.js. The PHP receiver needs to go somewhere on your local server and be sure to set the matching path in background.html.

So it seems to work. I think it’s kinda fun. If you know a better way to do this, please let me know.

Resources:

 

 

Posted in Programming | Tagged , , | Leave a comment

Extracting Table Data From PDFs with OCR

PDF is the ideal format for things you don’t want anybody to read.

Kidding.. sort of.. I am a bit biased against PDFs. Though I reluctantly admit their usefulness in a very few situations, mostly they’re just annoying. For a recent project, I wanted to extract a bunch of data from PDF documents (several hundred pages). All the data was nicely arranged in table format, as if it had been exported from Excel or something. Why the original Excel documents were not made available remains a mystery. Unfortunately, Select All – Copy – Paste completely mangled the text, but happily, it was possible to wrangle the data from the PDFs via OCR and some Python scripting.

The script below works like this:

  • Take a PDF file
  • Split it into separate pages
  • Convert each page into an image file (pixels)
  • Locate the horizontal and vertical lines on each page (long runs of black pixels)
  • Segment the image into cells using the line coordinates
  • Clean up each cell (remove borders, threshold to black and white)
  • Perform OCR on each cell
  • Assemble results into a 2D array

Optical Character Recognition is pretty amazing stuff, but it isn’t always perfect. To get the best possible results, it helps to use the cleanest input you can. In my initial experiments, I found that performing OCR on the entire document actually worked pretty well as long as I removed the cell borders (long horizontal and vertical lines). However, the software compressed all whitespace into a single empty space. Since my input documents had multiple columns with several words in each column, the cell boundaries were getting lost. Retaining the relationship between cells was very important, so one possible solution was to draw a unique character, like “^” on each cell boundary – something the OCR would still recognize and that I could use later to split the resulting strings.

Instead, I decided to OCR each cell individually. While slower, this seemed cleaner, more flexible, and easier to debug.

So here’s the code, there are a few dependencies:

  • Recent-ish Python
  • PIL (Python Imaging Library)
  • Tesseract OCR (I am using v3, but I think v2 will work too)
  • ImageMagick (to split PDFs into multiple pages)

It is slightly tuned to the particular files I was interested in (for example, it expects the cell borders to be solid black). It is also pretty slow – so if you need to process a massive number of pages, this won’t work for you. Also, it expects to operate in the directory you run it from and it expects there to be a subdirectory called “working” for temporary files. I suppose I should make the script do that automatically.. lazy, I guess..

import Image, ImageOps
import subprocess, sys, os, glob
 
# minimum run of adjacent pixels to call something a line
H_THRESH = 300
V_THRESH = 300
 
def get_hlines(pix, w, h):
    """Get start/end pixels of lines containing horizontal runs of at least THRESH black pix"""
    hlines = []
    for y in range(h):
        x1, x2 = (None, None)
        black = 0
        run = 0
        for x in range(w):
            if pix[x,y] == (0,0,0):
                black = black + 1
                if not x1: x1 = x
                x2 = x
            else:
                if black > run:
                    run = black
                black = 0
        if run > H_THRESH:
            hlines.append((x1,y,x2,y))
    return hlines
 
def get_vlines(pix, w, h):
    """Get start/end pixels of lines containing vertical runs of at least THRESH black pix"""
    vlines = []
    for x in range(w):
        y1, y2 = (None,None)
        black = 0
        run = 0
        for y in range(h):
            if pix[x,y] == (0,0,0):
                black = black + 1
                if not y1: y1 = y
                y2 = y
            else:
                if black > run:
                    run = black
                black = 0
        if run > V_THRESH:
            vlines.append((x,y1,x,y2))
    return vlines
 
def get_cols(vlines):
    """Get top-left and bottom-right coordinates for each column from a list of vertical lines"""
    cols = []
    for i in range(1, len(vlines)):
        if vlines[i][0] - vlines[i-1][0] > 1:
            cols.append((vlines[i-1][0],vlines[i-1][1],vlines[i][2],vlines[i][3]))
    return cols
 
def get_rows(hlines):
    """Get top-left and bottom-right coordinates for each row from a list of vertical lines"""
    rows = []
    for i in range(1, len(hlines)):
        if hlines[i][1] - hlines[i-1][3] > 1:
            rows.append((hlines[i-1][0],hlines[i-1][1],hlines[i][2],hlines[i][3]))
    return rows          
 
def get_cells(rows, cols):
    """Get top-left and bottom-right coordinates for each cell usings row and column coordinates"""
    cells = {}
    for i, row in enumerate(rows):
        cells.setdefault(i, {})
        for j, col in enumerate(cols):
            x1 = col[0]
            y1 = row[1]
            x2 = col[2]
            y2 = row[3]
            cells[i][j] = (x1,y1,x2,y2)
    return cells
 
def ocr_cell(im, cells, x, y):
    """Return OCRed text from this cell"""
    fbase = "working/%d-%d" % (x, y)
    ftif = "%s.tif" % fbase
    ftxt = "%s.txt" % fbase
    cmd = "tesseract %s %s" % (ftif, fbase)
    # extract cell from whole image, grayscale (1-color channel), monochrome
    region = im.crop(cells[x][y])
    region = ImageOps.grayscale(region)
    region = region.point(lambda p: p > 200 and 255)
    # determine background color (most used color)
    histo = region.histogram()
    if histo[0] > histo[255]: bgcolor = 0
    else: bgcolor = 255
    # trim borders by finding top-left and bottom-right bg pixels
    pix = region.load()
    x1,y1 = 0,0
    x2,y2 = region.size
    x2,y2 = x2-1,y2-1
    while pix[x1,y1] != bgcolor:
        x1 += 1
        y1 += 1
    while pix[x2,y2] != bgcolor:
        x2 -= 1
        y2 -= 1
    # save as TIFF and extract text with Tesseract OCR
    trimmed = region.crop((x1,y1,x2,y2))
    trimmed.save(ftif, "TIFF")
    subprocess.call([cmd], shell=True, stderr=subprocess.PIPE)
    lines = [l.strip() for l in open(ftxt).readlines()]
    return lines[0]
 
def get_image_data(filename):
    """Extract textual data[rows][cols] from spreadsheet-like image file"""    
    im = Image.open(filename)
    pix = im.load()
    width, height = im.size
    hlines = get_hlines(pix, width, height)
    sys.stderr.write("%s: hlines: %d\n" % (filename, len(hlines)))
    vlines = get_vlines(pix, width, height)
    sys.stderr.write("%s: vlines: %d\n" % (filename, len(vlines)))
    rows = get_rows(hlines)
    sys.stderr.write("%s: rows: %d\n" % (filename, len(rows)))
    cols = get_cols(vlines)
    sys.stderr.write("%s: cols: %d\n" % (filename, len(cols)))
    cells = get_cells(rows, cols)
 
    data = []
    for row in range(len(rows)):
        data.append([ocr_cell(im,cells, row, col) for col in range(len(cols))]) 
    return data
 
def split_pdf(filename):
    """Split PDF into PNG pages, return filenames"""
    prefix = filename[:-4]
    cmd = "convert -density 600 %s working/%s-%%d.png" % (filename, prefix)
    subprocess.call([cmd], shell=True)
    return [f for f in glob.glob(os.path.join('working', '%s*' % prefix))]
 
def extract_pdf(filename):
    """Extract table data from pdf"""
    pngfiles = split_pdf(filename)
    sys.stderr.write("Pages: %d\n" % len(pngfiles))
    # extract table data from each page
    data = []
    for pngfile in pngfiles:
        pngdata = get_image_data(pngfile)
        for d in pngdata:
            data.append(d)
        # remove temp files for this page
        os.system("rm working/*.tif")
        os.system("rm working/*.txt")
    # remove split pages
    os.system("rm working/*")   
    return data
 
if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Usage: ctocr.py FILENAME"
        exit()
    # split target pdf into pages
    filename = sys.argv[1]
    data = extract_pdf(filename)
    for row in data:
        print "\t".join(row)

Anyhow, I think it is kinda fun. Since the OCR is not actually magic, some post-processing may be necessary. In particular, I’ve noticed “o” (the letter) in place of “0″ (the number) sometimes, extra whitespace or oddly split words, and occasional wrong letters. But overall, the accuracy is still fantastic.

The usual caveats apply: use at your own risk, etc.

Posted in Programming | Tagged , | Leave a comment

Huh, Bitcoin = Pretty interesting

Read an interesting article on Ars Technica this morning. Looks like Bitcoin had already made the rounds earlier this summer, but I guess I missed it.

Bitcoin is the first legitimate crypto-currency, an idea first suggested in 1998. It is unique in several ways:

First of all, it is (mostly) anonymous, just like cash. Mostly – because, like cash, it is not anonymous under conditions of physical surveillance or if either party is coerced.

Second, it eliminates the need for 3rd party payment processors like Paypal and even credit cards. In a traditional online transaction, the payment processor holds the secret account numbers for both parties and conducts the transaction. Under the bitcoin scheme, all transactions are published freely using public key cryptography to conceal the identities of both parties. This allows the economy to incorporate the transfer of money without needing an intermediate payment processor.

Also interesting is that the system is designed to be inflation-proof. Unlike a traditional national currency, bitcoin is controlled by an algorithm. There’s no central authority that can decide to increase the money supply and cause inflation. Instead, there is a fixed supply of 21M bitcoins which will be distributed at a geometrically decreasing rate. Each bitcoin can be subdivided, so as trading in single bitcoins becomes impractical, people can trade in millibitcoins and microbitcoins.

As a P2P network, the system relies on creating consensus between nodes and can be subverted if someone can muster enough computing resources to control more than half the network. In the age of massive botnets, that’s not unfeasible. Proponents argue that there’s no economic incentive, since subverting the network would ruin the saboteur’s own bitcoin investment. However, there still seems to be a risk from someone stealing the network just for laughs.

Another danger is that, at least anecdotally, bitcoin is being used to buy/sell illegal goods and services or for money laundering. That doesn’t bode well for its long-term viability. To be really useful, it needs some mainstream acceptance. A list of sites that accept the currency looks mildly promising.

Anyway.. it seems like quite an interesting system – and very sci-fi. The system even comes with an anonymous inventor who designed the protocol and published the original paper under a pseudonym.

I’m not buying bitcoins just yet. But it would be neat to see something like this catch on.

Posted in Uncategorized | Leave a comment

A Little Job Scraper

Often times you reach a point in a project where it is handy to have some real data. So today I wrote a little program to grab one page worth of Want Ads from the venerable Craigslist.

Having served its intended purpose, it seemed fun to tweak the program to keep track of new job postings on craigslist. So.. here’s that..

This program just reads the pages you specify and scans for any URLs it hasn’t seen before. If you run it via cron, say, once a day, it will give you the new postings for that day. Each new url is recorded, so it doesn’t notify you twice about the same job.

In python:

import urllib2, time
from BeautifulSoup import BeautifulSoup
 
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
 
import socket
socket.setdefaulttimeout(5)
 
# pages to monitor
categories = [
    "http://knoxville.craigslist.org/sof/",
    "http://knoxville.craigslist.org/eng"
]
 
# data file for visited url list
dat = ".cl.exclude"
 
# build list of urls already visited
exclude = []
try:
    for line in open(dat).readlines():
        exclude.append(line[:-1])
except:
    pass
 
 
# get unseen urls from each category page
urls = []
for category in categories:
    try:
        page = urllib2.urlopen(category)
        soup = BeautifulSoup(page)
        for a in soup.findAll('a'):
            # must be a url
            if not a.has_key('href'): continue
            # must match current category (to exclude help pages/etc)
            if a['href'].find(category) == -1: continue
            # ok, keep this url
            urls.append(a['href'])
    except Exception, e:
        raise e
 
# visit each url to get the title and content
for url in urls:
    # skip if already seen
    if a['href'] in exclude: continue
    try:
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page)
        title = soup.find("title").string
        body = soup.find("div", {"id": "userbody"}).string
        # do something interesting here, like email the list to yourself
        print url, title
    except Exception, e:
        raise e
    # scrape slowly
    time.sleep(10)
 
# write list of all urls from this time
# note: there is no need to remember ALL the old urls since
# the urls are unique and we aren't dealing with pagination 
# it is safe to forget urls that are past the first page of results
fout = open(dat,'w')
for url in urls:
    fout.write(url+"\n")
fout.close()

Obviously, scraping is potentially rude. This is pretty lightweight, since it only checks URLs it hasn’t seen before and waits 10 seconds between visits. Nevertheless, use at your own risk.

The best way to use this is probably tweaking it to email you about new jobs. I’ve omitted that code since it is:

  1. Pretty well documented elsewhere
  2. Email originating from a home server will probably be rejected anyway
Posted in Uncategorized | Leave a comment

Yay Dash C

I think our internet is rate-limited. That’s annoying because I don’t do any of the stuff that maybe deserves it (looking at you BitTorrent!). I haven’t exactly quantified the problem yet, but the main symptom is a very reasonable rate of 300K or so dropping to 3K-5K after the first 1-2 Mb. Since many webpages are in the 1-2Mb range (or substantially smaller), it isn’t a big deal for regular browsing, but video becomes basically unwatchable. I’m not sure if the rate-limiting is on specific types of files (video) or everything.. or maybe I’m just imagining the whole thing.

Either way – Dialup is so 1999. Right?!

Thankfully, there’s youtube-dl, which downloads youtube videos for offline viewing. Unfortunately, the rate-limiting is still problematic. After a couple of MB, the rate drops and the download effectively stops (and doesn’t appear to recover if you leave it running for awhile). Youtube-dl has a “-c” option (just like wget) which tries to continue your previous download instead of starting over.

A totally garbage solution that works: just restart the download every 10 seconds until it’s done. You get the good rate for a few seconds and restart every time the rate drops. This works.. but doing it by hand is annoying (or unfeasible for a big file). A better solution is to have a script that runs youtube-dl automatically for 10 seconds, kills it, restarts it, and repeats until the file is completely downloaded.

So it would be nice to have a way to run a program for a certain number of seconds. People much smarter than me have already figured this out in the form of a bash script:

http://www.bashcookbook.com/bashinfo/source/bash-4.0/examples/scripts/timeout3

You can use it like this:

timeout 10 youtubedl -c "url_of_youtube_video"

So that works, now just wrap it up in a loop. 10 tries is probably enough to get a video. I know there are smarter ways to check for completion, but I’m pretty lazy and this is good enough:

for i in {1..10}
do
  timeout 10 youtubedl -c "url_of_youtube_video"
done

Not exactly as good as just watching videos in the browser, but it resolves my frustration anyway.

Posted in Uncategorized | Leave a comment

The Overton Window

I love finding out that some fluttering thought has a proper name.

Reasonable people should agree that simply having two sides to an issue doesn’t make them equally correct. If you disagree, just take any issue you feel strongly about, consider the polar opposing, and decide if the you would agree to the 50-50 compromise. You would? Okay, well move the other viewpoint one step towards the extreme. Would you still agree? Certainly not – 50-50 became 40-60 – the former “compromise” now favors your opponent.

Suppose I argue that a triangle has 5 sides.

You say 3.

Should we compromise on 4?

What if I say 10? Is the number of sides of a triangle even up for debate?

That’s the essence of the Overton Window – the range of beliefs that reasonable people can hold on a topic.

The difficulty lies in the Argument to Moderation, a fallacy that, given two extremes, the truth necessarily lies in the middle. Proponents of a particular viewpoint can manipulate the Overton Window by adopting values more extreme than their actual beliefs. As a result, the apparent middle ground shifts, changing the whole debate.

Not exactly a revelation – people manipulate each other and the public opinion.

I was just intrigued that there’s a term that particular phenomenon.

A few related links:

Posted in Uncategorized | Leave a comment

Web Font Picker

Google Web Fonts is kinda awesome. If you haven’t checked it out already – basically it gives you a ton of new font choices that still degrade gracefully for older browsers. All you have to do is add a stylesheet to your page and specify the ‘font-family’. It truly couldn’t be easier. Also.. yay, free!

One annoyance is the workflow. You have to look at the collection, edit your css and/or webpages, reload, repeat. (However, the fonts are available for download, if you use Photoshop/Gimp/etc to design your pages).

The following code lets you change fonts on the fly by adding a little dropdown box to the top right corner. When you doubleclick anything on the page, it will be styled with the chosen font. The code is a bit ugly since it’s just the first thing that came to mind. Nevertheless, I think it’s kinda neat for experimenting.

// list of fonts to try
var families = ['Yellowtail','Astigmatic','Leckerli One'];
// build the dropdown box
$('body').append($('<select id="fontpicker"></select>'));
for(var i=0; i<families.length; i++) {
    $('#fontpicker').append('<option value="'+families[i]+'">'+families[i]+'</option>');
}
$('#fontpicker').css({'position': 'absolute','top': '0px', 'left': '0px'});
 
// bind doubleclick on every element
$('*').live('dblclick', function() {
    var family = $('#fontpicker').val();
    var href = "http://fonts.googleapis.com/css?family="+family+"&v2";
    var stylesheet = "<link href='http://fonts.googleapis.com/css?family="+family+"&v2' rel='stylesheet' type='text/css'>";
    $(this).css('font-family', family);
    // try not to load the same stylesheet twice
    var found = 0;
    $("head link[rel='stylesheet']").each(function() {
        if($(this).attr('href') == href) {
            found = 1;
        }
    });
    if(found == 0) {
        $('head').append(stylesheet);
    }
});

There’s one big problem still, which is that you need to specify WHICH fonts you want to make available to the switcher. Better than doing it one at a time, but not as good as pulling the complete list from Google. I’m not sure of a super great way to achieve that, but maybe something to consider.

Posted in Programming | Tagged | Leave a comment

Half-baked Objects and 10% ORM

I’ve used Object Relational Mapping (ORM) libraries on a few projects in the past. Without getting into the many, many details, ORM bridges the gap between data storage in a relational database and Object-Oriented Programming. Simply, instead of writing SQL queries, you let the ORM library write them for you. It’s great when it works out, but like all code generators, there are some potential downsides:

  • One more library to learn
  • May generate inefficient SQL (or more efficient, in some cases)
  • If there’s a problem, you may be taking a deep dive into the code to figure it out

Whether it’s worthwhile is simply a matter of getting more out of it than you put in. As an alternative, I’ve started using a technique to build Objects on-the-fly from multi-table joins. This doesn’t handle every case (not even close!), but it does handle the cases I need.

So suppose you’ve got a webpage with Users and Posts and Comments. Each Post can have multiple Comments, and a User can “Like” a comment. A normalized version looks something like this:

CREATE TABLE users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(50)
);
CREATE TABLE posts (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(50)
);
CREATE TABLE comments (
  id INT AUTO_INCREMENT PRIMARY KEY,
  post_id INT,
  content VARCHAR(50)
);
CREATE TABLE liked_comments (
  user_id INT,
  comment_id INT
);

Now on this webpage, you want to show all of a user’s Liked Comments. So you probably have a view template that loops over the comments, showing the comment text and a link back to the Post, something like this:

<?php foreach($comments as $comment): ?>
  <div class="comment">
    <p><?php echo($comment->content); ?>
    <p>On <a href="<?php echo($comment->post->link()); ?>"><?php echo($comment->post->name); ?></a></p>
  </div>
<?php endforeach ?>

Now the question is, where should the post permalink come from? I can think of at least 3 reasonable answers:

// 1. from a method on the comment
<a href="<?php echo($comment->post_link()); ?>"><?php echo($comment->post_name()); ?></a>
 
// 2. from a method on the post
<a href="<?php echo($comment->post->link()); ?>"><?php echo($comment->post->name()); ?></a>
 
// 3. from the template, using properties of the comment
<a href="/post/<?php echo($comment->post_id); ?>"><?php echo($comment->post_name); ?></a>

I would argue that the 2nd option is the best. In the 1st option, the Comment class needs methods to handle displaying a post, which seems unnatural and leads to duplication. In the 3rd option, the View is building the URL, which is a pain if you ever want to change it later, since you’d need to update all your views. The best thing is to let the Post know how to build it’s own permalink, the method might look like this:

// in Post class
public function link() {
  return "/post/" . $this->id;
}

So how to build a list of Comments, each with a nested Post object? Here’s one possibility:

SELECT
  comments.content AS content,
  posts.id AS post_id,
  posts.name AS post_name
FROM liked_comments
JOIN comments ON liked_comments.comment_id = comments.id
JOIN posts ON comments.post_id = posts.id
WHERE user_id = 1

So that’s fine, let’s say you instantiate a Comment for each row. Something like this:

<?php
$comments = array();
$rs = mysql_query($sql);
while($row = mysql_fetch_assoc($rs)) {
  $comments[] = new Comment($row);
}
?>

So that creates a list of comments for our View. All that’s missing is to instantiate a nested Post for each Comment. This can be done in the Comment constructor:

public function __construct($args=NULL) {
  if($args && is_array($args)) {
    if(array_key_exists('post_id', $args) && array_key_exists('post_name', $args)) {
      $this->post = new Post();
      $this->post->id = $args['post_id'];
      $this->post->name = $args['post_name'];
    }
  }
  // other stuff..
}

So when we instantiate a Comment and provide the appropriate keys (post_id and post_name), it will instantiate a Post for us. It’s not really a proper Post, but more of a half-baked object. It doesn’t have an author, content or other things you might expect in a Post; instead, it has just the two keys to know how to display its permalink.

Now this works fine, but having a bunch of hacked-up constructors isn’t very nice and we’re still requiring the Comment class to know something about the structure of Posts. A better alternative is to make a super class with a more generic constructor that can be used by any class to instantiate any other class (or classes) based only on the row names. Here is the more generic version I am currently using:

// in a base class
function __construct($row, $params=NULL)
{
    foreach($row as $k=>$v) {
      $this->$k = $v;
    }
    $klass_map = NULL;
    if( $params ) {
        if(array_key_exists('klass_map', $params)) {
            $klass_map = $params['klass_map'];
        } 
    }
    $vars = get_object_vars($this);
    foreach($vars as $k=>$v)
    {
        $split = strpos($k, '_');
        if( $split === FALSE ) {
            continue;
        } else {
            $prefix = substr($k, 0, $split);
            $postfix = substr($k, $split+1);
            if( ! isset($this->{$prefix}) ) {
                if( $klass_map && array_key_exists($prefix, $klass_map) ) {
                    $this->{$prefix} = new $klass_map[$prefix];
                } else {
                    $this->{$prefix} = new stdClass;
                }
            }
            $this->{$prefix}->{$postfix} = $v;
            unset($this->$k);
        }
    }
    //echo('<pre>');
    //exit(print_r($this));
}

Well that looks a little more complicated. Basically, it just looks to see if there is an underscore in each property name, and if there is, it tries to instantiate that property as a class. A mapping tells it which prefixes go with which classes. For example:

$this->post_id becomes $this->post->id
$this->post_name becomes $this->post->name
$this->user_id becomes $this->user->id
$this->content just stays the same (no underscore)

So how to use that constructor? Something like this:

<?php
$params = array(
  'klass_map' => array(
    'post' => 'Post', // post_ prefix maps to Post class
   ),
);
$comments = array();
$rs = mysql_query($sql);
while($row = mysql_fetch_assoc($rs)) {
  $comments[] = new Comment($row, $params);
}
?>

The key observation is that $this->post is not a generic stdClass, but an instance of Post that has been created with only the properties we know we’re gonna need.

There are some obvious downfalls here:

First, using magic constructors can make things unnecessarily complicated and may cause conflicts with libraries that do their own magic. Adding/removing (unsetting) properties seems particularly hazardous.

Second, you have to write your SQL carefully so you get the row names and mappings you need. In particular, row names like “modified_on” would not behave as expected. It should be easy to tweak the generic constructor to be a bit more robust.

Also, this really only handles the case of these nested 1:1 mappings. I think you could extend the idea, which is fairly useful by itself, but I would bet it gets complicated quickly as you head towards real ORM territory.

Despite the shortcomings, I’m finding this to be a convenient way to construct objects on-the-fly at the early prototyping stages of a project when I’m constantly renaming things and moving code around.

Posted in Programming | Tagged , , | Leave a comment

Worse Is Better

Some interesting essays from computer history: Worse Is Better. The original essay considers the success of C against the arguably superior Lisps, which failed to gain widespread popularity. It seems hard to predict whether a particular product will succeed using Worse Is Better – lots of times, worse just sucks – but it’s useful in retrospect to see why a product wins.

Posted in Random Links | Leave a comment