Where do Coding Agents Still Struggle?

Believe it or not, sometimes Claude Code makes mistakes! Just kidding. It really does do "the right thing" about 95% of the time, at least in my projects. That said, it's still worth considering some areas where agentic coding tools struggle.

Many of the things listed can be solved quickly with either a bit of additional prompting, or with better context engineering.

But I think these problems and fixes may be non-obvious, especially to non-coders. It's hard to solve problems you don't know about. I'm also just really interested in the "out of the box" behavior of these systems

Installing the wrong version of a package

Generally package managers will install the latest stable version of a software library. However in some cases, the agent may decide to install an old version. Normally it doesn't really matter. But it can be a problem when two versions have major incompatible differences, of if a specific version has major security flaws.

This mostly happens for two reasons. First, LLMs have a learning cut-off date. Anything later, the LLM doesn't automatically know about. The cut-off alone won't stop a package manager from installing a recent version, but it could stop the LLM from noticing problems that it isn't aware of. Second, the training data or context may contain sample code or documentation that references multiple package versions. Especially with software that has been around for decades, or software that recently underwent a major change, the LLM may choose the old version, simply because that version appears much more frequently in the training data.

Specifically asking for the "the latest version" or "version 1.3", or using a Search Agent to pull in the most recent information, can all help with this, but it doesn't necessarily happen by default.

Starting network accessible servers

If you ask an LLM/agent to provide code for an HTTP server, it's very likely that you'll get something like this:

if __name__ == "__main__":
    import uvicorn
    uvicorn.run("main:app", host="0.0.0.0", port=8000)

That "host" value matters a LOT.

If the host is "127.0.0.1", the server is only accessible from your own computer. If the host is "0.0.0.0", that server is available to ANYONE on the local network. This is a huge difference.

On a home or private network, it may not be a huge deal. A typical wifi router's firewall is gonna offer some protection, assuming that you can trust all the devices on your network (a big if!), and assuming that you're not doing port-forwarding or anything weird.

But you absolutely wouldn't want to run that server while sitting at the local coffee shop, or at a co-working space, or anywhere with a shared network.

Whether the LLM generates safe or dangerous code just depends. Both variants exist in abundance in the training data. Ironically, this LLM-generated code is likely to be suggested at the worst possible time, during the early development phases of software, when bugs are common and security is incomplete.

This can largely be addressed with additional prompting ("never run servers on 0.0.0.0, always use "127.0.0.1") or by only running servers inside of Docker or a VM.

Removing Defunct Files and Code

I've run into many different variations of this. Claude will create something and then forget to delete it.

This happens a lot with testing files. Often the agent will create a temporary file, such as a quick script to check network connectivity. Then once the task is complete, the file never gets deleted. This makes some sense and is pretty analogous to human programmer behavior. There's no obvious time to delete accumulated helper scripts. Afterall, you might need them later! For agents, these scripts can be wasteful or confusing. If they get continually updated, but never actually used, they consume unnecessary tokens and time. Alternatively, if the scripts are never updated, they pollute the context or confuse any human programmers looking at the project.

A similar thing happens with unused functions, variables and imports. Claude often does a good job managing these, but it's not 100%. Pretty frequently, I will spot long-lived chunks of code that aren't actually used.

Another related thing here is incomplete refactors. Claude will get almost everything right, but then miss a particular file, or leave a dangling variable. I think this happens especially with tricky naming conventions or pluralizations. Refactoring is mostly text processing, and sometimes a variable slips through the cracks. It may update one file, but fail to make corresponding changes in another file that isn't currently referenced in the context.

Preserving Reverse Compatibility

This is similar, but stranger. I've noticed many cases where an agent specifically notices that code is no longer used, but rather than removing the old code, it is retained for "reverse compatibility". This is an actual example, after renaming two variables:

# Direction of data flow
LOCAL_TO_REMOTE = "local->remote"
REMOTE_TO_LOCAL = "remote->local"

# Backwards compatibility aliases
CLIENT_TO_SERVER = "local->remote"
SERVER_TO_CLIENT = "remote->local"

The "backwards compatibility aliases" were not actually used anywhere else in the project anymore. I've seen this happen in a few different projects now. Claude doesn't automatically know whether it's working on a greenfield or legacy code, and it decides that maintaining reverse compatibility is important. I'm curious whether this behavior naturally occurs from the training data, or if it was specifically trained into the model.

Floating Point and Use of Numeric Types

I'm thinking specifically of float versus int, and how that propagates through the code. Claude does a good job of carrying type information. But it doesn't always understand the conditions when a float or int is preferable, and it doesn't always seem to safely compare ints and floats.

For those unfamiliar, this is a sorta classic example of why float math feels surprising:

x = 0.1 + 0.2
y = 0.3

print(x)        # 0.30000000000000004
print(y)        # 0.3
print(x == y)   # False

Sometimes this kinda thing gets noticed and fixed automatically by unit tests. But not always. It's a potential source of bugs for LLM-generated code.

Estimating Time

I've seen this one pretty widely noted. Especially with batch tasks, Claude will sometimes predict that running the code will take 10-20 minutes, when in reality it often takes just a few seconds. Another version of this is when "plan mode" predicts that a TODO list will take 2-3 days to complete, but it actually finishes in just a few minutes. Presumably these guesses have been learned from human estimates that appear in the training data.

What Helps?

Most these are relatively minor problems for programmers with a couple of years of experience. However, lots of non-coders are experimenting too. I suspect a few of these could be more serious for those folks. Not knowing the difference between 0.0.0.0 and 127.0.0.1, or the difference between integer and floating point, could actually cause real trouble.

I'm slowly building out my personal lists of "gotchas" to avoid. One thing that helps is just periodically asking Claude to "eliminate redundant code, look for opportunities to refactor, apply software engineering best practices". That can fix a lot of the minor problems. More formally, many folks either place explicit instructions into CLAUDE.md or they use Skills to introduce a set of best practices for the agent.

Footnotes

I'm primarily using Claude Code right now, but my sense is the same behavior applies to most agents, since they're generally powered by the same handful of flagship LLMs.