I experimented with "agentic coding" a lot over the holidays. This was mostly vibe-coding, not even skimming 99% of the generated code, but I think similar conclusions can be drawn across the automation spectrum. I was surprised just how good the results were.

The first experiment was to code a Scrabble solver.

Why Scrabble? Well, I used to play a LOT of Yahoo! Literati back in high school. This was Yahoo's version of Scrabble, which pre-dated the wildly popular Words with Friends by a decade. Skip ahead to the summer of 2003 (2004?), and I had read a paper called The World's Fastest Scrabble Program (Appel and Jacobson, 1988) which described efficient data structures and methods for computing the highest scoring words in scrabble. I had just learned Java programming, and it took me three days to grok the paper and implement the code. Being able to "solve" a scrabble board felt extremely cool to me at the time.

MEANWHILE, back in the present..

I wanted to see what Claude could do with Scrabble.

I did watch and approve each step, but didn't do much else beyond suggesting a broad outline. In the future, I may try zero-shotting it for comparison.

My exact prompts were longer and boring. But the process was basically this:

  1. Ask Claude to research the rules of Scrabble
  2. Ask Claude to fetch and summarize the research paper
  3. Ask Claude to choose a text representation of the board (to allow saving/loading)
  4. Ask Claude to generate unit tests
  5. Ask Claude to write the actual code

It took 20-30 minutes and cost about $4.00 to develop a working solution in python. I kept an eye on the code generation, and provided one suggestion for fixing a unit test. Otherwise, it was hands-off.

It worked great and I was pretty amazed.

I thought I'd been paying attention and keeping an eye on what's happening. But experiencing end-to-end code generation FEELS DIFFERENT than just reading about it.

What did I learn?

Claude barely needs you

I was very surprised just how little guidance is needed to get good results. To give one example, I wanted to save/load the game state in a text file. This requires representing the board as a 15x15 grid, along with the current score, each player's rack of letters, etc. It's not difficult exactly. But it's tedious and there are lots of little decisions, trade-offs and edge cases to consider. There's no single "correct" way to design a save file.

My instinct here was to pause and write a complete specification, and then provide some examples for Claude to follow. This kind of "few shot" prompting can be smart, especially if the details matter! But here, I quickly realized that I just didn't care. I wanted a file format that was human readable and "good enough". Claude could generate the specification itself, and I simply needed to approve or refine it.

The "delegate everything" approach is a different way of thinking after a few decades of writing code.

Claude doesn't always know when to stop researching

When I asked Claude to learn the rules of Scrabble, it wanted to fetch 10 webpages. I looked at each URL and knew that it had the complete rules after the second page, but it wanted to continue fetching all 10. The same happened with the research paper. I asked Claude to access and summarize the PDF, and Claude immediately found the original work, but then planned to keep looking. This makes sense. Afterall, what does it mean to conduct "enough" research? It's not a problem exactly, but the behavior struck me. It's something I'd like to explore further because fetching unneeded resources matters for context window size, token usage (cost), and it causes wasteful, unnecessary server load for those hosts.

And now I have a lot more questions to follow up on..

Witnessing Claude writing a Scrabble solver with minimal supervision was both fun and totally alien.

  1. Would one-shotting have worked? Or was my supervision helpful?
  2. What algorithm would Claude choose if I hadn't asked for one specific research paper?
  3. Was the use of buffers (aka notebooks, scratchpads) an improvement or a distraction?
  4. Were my instructions good? What would make them better?
  5. How well does the behavior scale to larger codebases?