Extract Tabular Data From PDFs with OCR


PDF is the ideal format for things you don't want anybody to read.

Kidding.. sort of.. I am a bit biased against PDFs. Though I reluctantly admit their usefulness in a very few situations, mostly they're just annoying. For a recent project, I wanted to extract a bunch of data from PDF documents (several hundred pages). All the data was nicely arranged in table format, as if it had been exported from Excel or something. Why the original Excel documents were not made available remains a mystery. Unfortunately, Select All - Copy - Paste completely mangled the text, but happily, it was possible to wrangle the data from the PDFs via OCR and some Python scripting.

The script below works like this:

Optical Character Recognition is pretty amazing stuff, but it isn't always perfect. To get the best possible results, it helps to use the cleanest input you can. In my initial experiments, I found that performing OCR on the entire document actually worked pretty well as long as I removed the cell borders (long horizontal and vertical lines). However, the software compressed all whitespace into a single empty space. Since my input documents had multiple columns with several words in each column, the cell boundaries were getting lost. Retaining the relationship between cells was very important, so one possible solution was to draw a unique character, like "^" on each cell boundary - something the OCR would still recognize and that I could use later to split the resulting strings.

Instead, I decided to OCR each cell individually. While slower, this seemed cleaner, more flexible, and easier to debug.

So here's the code, there are a few dependencies:

It is slightly tuned to the particular files I was interested in (for example, it expects the cell borders to be solid black). It is also pretty slow - so if you need to process a massive number of pages, this won't work for you. Also, it expects to operate in the directory you run it from and it expects there to be a subdirectory called "working" for temporary files. I suppose I should make the script do that automatically.. lazy, I guess..

import Image, ImageOps
import subprocess, sys, os, glob

# minimum run of adjacent pixels to call something a line
H_THRESH = 300
V_THRESH = 300

def get_hlines(pix, w, h):
 """Get start/end pixels of lines containing horizontal runs of at least THRESH black pix"""
 hlines = []
 for y in range(h):
     x1, x2 = (None, None)
     black = 0
     run = 0
     for x in range(w):
         if pix[x,y] == (0,0,0):
             black = black + 1
             if not x1: x1 = x
             x2 = x
             if black > run:
                 run = black
             black = 0
     if run > H_THRESH:
 return hlines

def get_vlines(pix, w, h):
 """Get start/end pixels of lines containing vertical runs of at least THRESH black pix"""
 vlines = []
 for x in range(w):
     y1, y2 = (None,None)
     black = 0
     run = 0
     for y in range(h):
         if pix[x,y] == (0,0,0):
             black = black + 1
             if not y1: y1 = y
             y2 = y
             if black > run:
                 run = black
             black = 0
     if run > V_THRESH:
 return vlines

def get_cols(vlines):
 """Get top-left and bottom-right coordinates for each column from a list of vertical lines"""
 cols = []
 for i in range(1, len(vlines)):
     if vlines[i][0] - vlines[i-1][0] > 1:
 return cols

def get_rows(hlines):
 """Get top-left and bottom-right coordinates for each row from a list of vertical lines"""
 rows = []
 for i in range(1, len(hlines)):
     if hlines[i][1] - hlines[i-1][3] > 1:
 return rows          

def get_cells(rows, cols):
 """Get top-left and bottom-right coordinates for each cell usings row and column coordinates"""
 cells = {}
 for i, row in enumerate(rows):
     cells.setdefault(i, {})
     for j, col in enumerate(cols):
         x1 = col[0]
         y1 = row[1]
         x2 = col[2]
         y2 = row[3]
         cells[i][j] = (x1,y1,x2,y2)
 return cells

def ocr_cell(im, cells, x, y):
 """Return OCRed text from this cell"""
 fbase = "working/%d-%d" % (x, y)
 ftif = "%s.tif" % fbase
 ftxt = "%s.txt" % fbase
 cmd = "tesseract %s %s" % (ftif, fbase)
 # extract cell from whole image, grayscale (1-color channel), monochrome
 region = im.crop(cells[x][y])
 region = ImageOps.grayscale(region)
 region = region.point(lambda p: p > 200 and 255)
 # determine background color (most used color)
 histo = region.histogram()
 if histo[0] > histo[255]: bgcolor = 0
 else: bgcolor = 255
 # trim borders by finding top-left and bottom-right bg pixels
 pix = region.load()
 x1,y1 = 0,0
 x2,y2 = region.size
 x2,y2 = x2-1,y2-1
 while pix[x1,y1] != bgcolor:
     x1 += 1
     y1 += 1
 while pix[x2,y2] != bgcolor:
     x2 -= 1
     y2 -= 1
 # save as TIFF and extract text with Tesseract OCR
 trimmed = region.crop((x1,y1,x2,y2))
 trimmed.save(ftif, "TIFF")
 subprocess.call([cmd], shell=True, stderr=subprocess.PIPE)
 lines = [l.strip() for l in open(ftxt).readlines()]
 return lines[0]

def get_image_data(filename):
 """Extract textual data[rows][cols] from spreadsheet-like image file"""    
 im = Image.open(filename)
 pix = im.load()
 width, height = im.size
 hlines = get_hlines(pix, width, height)
 sys.stderr.write("%s: hlines: %d\n" % (filename, len(hlines)))
 vlines = get_vlines(pix, width, height)
 sys.stderr.write("%s: vlines: %d\n" % (filename, len(vlines)))
 rows = get_rows(hlines)
 sys.stderr.write("%s: rows: %d\n" % (filename, len(rows)))
 cols = get_cols(vlines)
 sys.stderr.write("%s: cols: %d\n" % (filename, len(cols)))
 cells = get_cells(rows, cols)

 data = []
 for row in range(len(rows)):
     data.append([ocr_cell(im,cells, row, col) for col in range(len(cols))]) 
 return data

def split_pdf(filename):
 """Split PDF into PNG pages, return filenames"""
 prefix = filename[:-4]
 cmd = "convert -density 600 %s working/%s-%%d.png" % (filename, prefix)
 subprocess.call([cmd], shell=True)
 return [f for f in glob.glob(os.path.join('working', '%s*' % prefix))]

def extract_pdf(filename):
 """Extract table data from pdf"""
 pngfiles = split_pdf(filename)
 sys.stderr.write("Pages: %d\n" % len(pngfiles))
 # extract table data from each page
 data = []
 for pngfile in pngfiles:
     pngdata = get_image_data(pngfile)
     for d in pngdata:
     # remove temp files for this page
     os.system("rm working/*.tif")
     os.system("rm working/*.txt")
 # remove split pages
 os.system("rm working/*")   
 return data

if __name__ == '__main__':
 if len(sys.argv) != 2:
     print "Usage: ctocr.py FILENAME"
 # split target pdf into pages
 filename = sys.argv[1]
 data = extract_pdf(filename)
 for row in data:
     print "\t".join(row)        

Anyhow, I think it is kinda fun. Since the OCR is not actually magic, some post-processing may be necessary. In particular, I've noticed "o" (the letter) in place of "0" (the number) sometimes, extra whitespace or oddly split words, and occasional wrong letters. But overall, the accuracy is still fantastic.

The usual caveats apply: use at your own risk, etc.