Automatically Save HTML of every page you visit

(2011)

For the last couple of weeks, I've been thinking about the best way to capture the HTML of every webpage I visit. Sure, you can always write a screen scraper or bot, but I guess I wanted something a little more organic.

The right answer to this problem is probably: caching proxy. Alternatively, tcpdump or some cleverness with copying the temporary files from the browser cache might also work. However, I think there's a strong case for using the browser directly: first, you get nice cleaned-up HTML, and second, you get javascript execution (handy if there is ajax stuff on the page or if you want to use jQuery for pre-processing the HTML).

Basically, you want this:

In Firefox you've got Greasemonkey and User Scripts. These work in Chrome too, but it seems like the cross-domain restriction may be problematic. I didn't investigate too much further after reading that there might be a problem. Happily, if you write a properĀ full-on Chrome Extension, you can specify exceptions to the cross-domain rules.

So, following is the script I pieced together this morning. It's a chrome extension that grabs the source of every page you load (using jQuery's DOM methods). Then it POSTs to your local webserver. My example below is pretty minimal just to demonstrate that it works. Maybe someday I'll package it as a real extension, make it configurable and release it, but, you know, probably not.

Use at your own risk and all the usual disclaimers. Also, you should probably lock down the permissions and matches attributes to only run on your local server against the pages you're interested in.

manifest.json

{
 "name": "Capture HTML and POST to local server",
 "version": "0.0.1",
 "description": "Capture HTML and POST to local server",
 "permissions": [
   "http://*/*"
 ],
 "content_scripts": [
   {
     "matches": ["http://*/*"],
     "js" : ["jquery.min.js","contentscript.js"],
     "run at":"document_end"
   }
 ],
 "background_page": "background.html"
}

contentscript.js

function captureHTML() {
   var html = '<html>' + $('html').html() + '</html>';
   chrome.extension.sendRequest({html: html}, function(response) {
       alert(response.result);
   });
}
captureHTML();

background.html

<html>
<head>
<script type="text/javascript" src="jquery.min.js"></script>
<script type="text/javascript">// <![CDATA[

   chrome.extension.onRequest.addListener(
       function(request, sender, sendResponse) {
           var html = request.html;
           var url = 'http://localhost/recv.php';
           var data = {html:html};
           $.post(url, data, function(result) {
               sendResponse({result: result});                    
           });
   });

// ]]></script>
</head>
</html>

recv.php

<?php
$html = $_POST['html'];
$result = strlen($html);
echo ($result);
error_log($html);

Also, you will need to download a copy of the latest minimized jQuery and save it into the extension folder as jquery.min.js. The PHP receiver needs to go somewhere on your local server and be sure to set the matching path in background.html.

So it seems to work. I think it's kinda fun. If you know a better way to do this, please let me know.

Resources: