Earlier this week, I wrote up 10 random mini-projects that I'd recently tried with Claude Code. Most of the results were good, but for some reason, the ray tracer was very meh.

Claude SHOULD be good at this. Ray-tracing is the perfect task for an LLM-based coding assistant. The basic ideas have been around for decades, and it's a staple of every computer graphics course. There are tons of books, articles and working implementations to draw inspiration from. It should be easy.

PNG showing a red sphere on green background

That bothered me, so I tried again today.

The new batch of results were much better!

Nothing fancy. But this new image on the right was enough to convince me that "yup, this works fine".

Second experiment

What went wrong the first time? My hypothesis was that zero-shotting a fully functional ray-tracer was just too hard. Maybe Claude needed more hand-holding?

I decided to try three different approaches:

1. By the book

I skimmed through the early chapters of Ray Tracing in One Weekend, doing my best to understand the main ideas. I described them to Claude, using my own words. Of course, Claude has read the whole internet, so it already knows everything that I could possibly tell it. Nevertheless, my theory was that leading step-by-step would improve the results.

2. Another zero-shot

Maybe there was just something wrong with my first try. I wrote a new prompt, very short, like two sentences that basically said "make a ray tracer from scratch in python". The idea here was: what's the simplest thing that could possibly work?

3. Zero-shot with research

Maybe this is one-shot? Same idea, but with a slightly longer prompt. First, I asked for a "full-featured" ray tracer, whatever that means. Second, I explicitly requested that it begin with web research, then summarize the results in a text file, before doing the planning step. The idea here was that bringing in the right keywords could "prime" the context and conjure up better results.

In all three cases, I also asked for complete test coverage and a test-driven workflow.

The Results

Here are the (sorta) pretty pictures. They're arranged with the results of each program, left, middle and right.

The first task was drawing spheres. Very simple stuff. Note that I didn't specify exactly the same scene to each program, so they're not expected to be identical (more on that later).

Three renderings of spheres

Not bad, right?

Clearly it "works". These would get a passing grade in most college computer graphics classes.

I also tried a more complicated scene, known as the "Cornell Box". That middle one, with the minimal prompt, is actually not bad.

Three renderings of boxes

Many of the visual differences are due to the code. But some of the differences are also related to the specification of the scene itself. I didn't think to write a scene loader, so each received a slightly different input, based on how Claude decided that a Cornell Box should look.

Perfect? No, but viable. As a non-expert, I think it counts as ray-tracing.

Which was best?

I would say #2. Interestingly, the simplest prompt got the best overall results.

Why? I'm not sure! This is my guess..

The "by the book" attempt may have introduced errors. In trying to explain the book's content, I may have made mistakes or missed key information. In a similar way, the "add research" attempt may have just confused things. Doing web searches and requesting a "full-featured" ray-tracer, may have shifted the focus or introduced bugs. Claude "knows" about ray-tracing already, from reading the internet. Perhaps extra information was just a hinderance? Definitely something I want to explore more.

What went wrong originally?

A week later, I had already forgotten which specific prompts I had used. Luckily, a neat program called claude-code-log can "recover" all the conversations with Claude Code from log files and make them easily navigable as HTML pages. The raw logs are available in the ~/.claude folder as JSON files, along with a lot of other interesting stuff.

Inspecting my original attempt, it didn't seem too different, but I think I made 3 mistakes. Combined, these led to a poor result.

1. Lack of unit tests

My first attempt didn't have unit tests. There are references in the planning phase to performing some tests, but no actual tests were written or executed. I think that adding tests in my later attempts made a big difference by ensuring that all the math was correct (or at least closer). Having tests causes Claude to verify correctness and handle most edge cases.

2. My original task description was incorrect

I had written "..render the scene by casting rays from the light source". That description is just wrong. Rays are cast from the CAMERA, not the light source. Garbage in, garbage out. Reviewing the logs, Claude actually spotted this mistake and sought clarification:

"Casting rays from the light source" - this is unusual. I need to understand:
(gives 3 options)

I clarified that I had meant "camera", not "light source". Problem solved? Probably..

My gut feeling from using LLMs is that this simple correction was adequate. Given sufficient context, LLMs rarely get hung up on a single word. But it's possible that this, or something else in my original task description, was sufficiently wrong that it led to poor results.

3. Other differences?

Three other quick ideas for why my first experiment went poorly:

  • maybe I had accidentally used a weaker model
  • maybe the scene data format I had specified was a mistake
  • maybe some unknown error happened and I just didn't notice

So Ray Tracing does work?

I would say "yes". That's a relief because it SHOULD work, based on what I know about LLMs. I must have done something wrong in my first attempt that caused it to work poorly.

But.. those scenes above aren't perfect. Not even close. I hope to spend some more time digging into individual features and invent a way to approach my experiments more scientifically.

Followups

I am already planning a 3rd experiment in my head. Maybe sometime in the next 2-3 weeks. A few things I want to improve or investigate are:

How do results compare with IDENTICAL scenes?

In my three attempts here, each program is rendering a slightly different scene. I want to retry this with careful controls in place. No context pollution, exact same instructions for unit testing and outputs. I also need to add a scene loader, so that each program receives EXACTLY the same inputs.

How well does each prompt handle different features?

Beyond the most basic rendering, there are a LOT of extras that a ray-tracer can implement to improve realism, things like shadows, textures, reflections, different light sources, and a zillion more. I wasn't keeping track of which features landed in each of these programs. My instructions were just "make a ray-tracer". To make any comparison between outputs, the renderers really need to have the same features.

Isn't this just memorized code?

Sorta? I mean, yes, "fuzzy memorizing" is one thing that LLMs do. But to what extent is it using EXACTLY or CLOSELY memorized code? I'd like to compare the generated code to existing tutorial code, to github code, and to other agent-generated code. Is it identical? Or just similar? Seems tricky. Basic ray-tracing code will always look KINDA similar. Within the limits of normal and reasonable code, there's only so many ways to do it. Still, it would be interesting to investigate the extent to which different prompts result in different code.