Buildly Blog

When Not to Use Coding Agents (Honest Advice From the People Building Them)

2025-11-04 · Camille Fontaine

Boundary visualization showing scope limits for autonomous agents

We build autonomous coding agents for a living. We've spent the past two years thinking about where they work well and where they fail, and the honest answer is: the boundary matters as much as the capability. An engineering team that pushes agents into the wrong work category will get bad results, draw wrong conclusions about agent reliability, and give up before finding the workflow that actually helps them.

So here's our genuine thinking on where you should not use autonomous coding agents. This isn't a disclaimer — it's practical guidance from watching what happens when teams apply them incorrectly.

Greenfield Architecture and System Design

The worst use case we've seen for autonomous agents is greenfield architecture decisions. A team is building a new service, the design space is open, and someone has the idea that the agent can figure out the right structure. It can't, at least not usefully.

The Style Graph works because there are established patterns to learn from. On a new service with no history, there are no patterns — there are only design decisions. Should this service be event-driven or request-response? Should state live in the service or in a shared database? Where are the natural seams for future decomposition? These questions require someone who understands the business domain, the team's operational capabilities, and the system's likely evolution. An agent generates a plausible answer to these questions, but plausible is the wrong bar for architecture decisions. You need reasoned.

The same applies to major refactors where the destination architecture hasn't been established. If you're migrating from a monolith to services and haven't decided where the seams are, an agent will produce a refactor that reflects some pattern it learned from the codebase — which may or may not align with where you're actually trying to go. The refactor work that agents do well is execution within a defined architecture, not figuring out what the architecture should be.

Security-Critical Logic

We do not recommend autonomous agents for security-critical code paths: authentication, authorization, cryptographic operations, input validation for sensitive data, and anything involving financial transaction integrity.

This isn't about whether agents can write syntactically correct security code — they often can. The problem is subtler. Security vulnerabilities frequently live in the gap between "code that passes tests" and "code that handles adversarial inputs correctly." That gap requires a specific mode of thinking: assuming the input is crafted by someone trying to break your system, and reasoning through every path accordingly. That kind of reasoning is not what language models do well under normal generation conditions.

We're not saying agents can't assist with security code — code review assistance, explaining what a pattern does, suggesting standard library usage — but the generation of novel security logic that ends up in production without expert human review is a category we actively discourage, regardless of which tool is doing the generation.

Anything Requiring Deep Business Domain Knowledge

This is the category that most often catches teams off guard. The agent can read your codebase and understand code patterns. It cannot understand why your business works the way it does.

Consider a financial platform where account balances must never be read and written in separate transactions — a constraint that comes from regulatory requirements, not from anything visible in the code structure. The code might have comments about it, and the Style Graph will pick those up. But a novel code path that seems unrelated to the constraint might still violate it in ways that neither the agent nor a code reviewer without domain context would notice.

The test is simple: if the correctness of the task requires understanding something about your business that isn't expressible in the codebase itself, that task requires a human who has that understanding, not an agent that can only read the code.

Exploratory Prototyping

Agents are built for tasks with clear definitions of done. The definition might live in a ticket, in a test suite, in an existing pattern — but it exists. Exploratory prototyping is different: the definition of done is "we learned something" or "we found an approach that feels right." The value is in the process of thinking, not in the output.

We've seen teams use agents for prototyping and then be surprised when the resulting code isn't educational. The agent produced something that works, but nobody on the team understands why it works or how to extend it. That's the opposite of what prototyping is for.

If you're exploring a problem space, write the code yourself. The cognitive load of implementation is part of how you develop intuition for the problem. Use agents for the scaffolding and boilerplate once you've made the exploration decisions, not as a substitute for the exploration itself.

Where the Line Actually Is

The pattern across all these cases: agents work well when there's a learnable prior for the correct behavior (the Style Graph can infer what good looks like) and when the correctness criteria are expressible in terms the agent can check (does it compile, do tests pass, does it match the existing pattern). They work poorly when correctness requires judgment that isn't derivable from prior code — business domain reasoning, adversarial thinking, architectural vision.

The practical question to ask before assigning a task to an agent: is there a senior engineer who could complete this task correctly by reading the codebase and the ticket, without needing any context that isn't in those two sources? If yes, an agent is likely a reasonable candidate. If that engineer would need a conversation with a product manager, a discussion with a domain expert, or access to regulatory documentation that isn't in the codebase, the task isn't right for autonomous generation.

The teams that get the most out of Buildly are the ones that have thought clearly about this boundary and enforced it. They don't push agents into domain-complex work hoping for the best, and they don't avoid agents for pattern-following work out of excessive caution. They've drawn the line and they apply it consistently. That discipline — knowing what to delegate and what to keep — is most of what makes agent-assisted engineering work in practice.