Buildly Blog

2025 in Review: What We Got Right and Wrong About AI Coding Agents

2025-12-16 · Camille Fontaine

Year-in-review abstract timeline visualization

In January 2025, there was a particular flavor of optimism in the AI coding space that felt like it was about two years ahead of reality. Demos showed agents autonomously building full features end-to-end. Conference talks predicted the imminent obsolescence of boilerplate-focused engineering roles. Investors were putting money into visions of software that nearly writes itself.

It's December now. The honest picture is more interesting than either the optimists or the skeptics predicted — different things worked, different things failed, and the gap between what demo'd well and what shipped well was significant.

What We Got Right: The PR Boundary

The decision to build Buildly around PR-first operation — agents always produce pull requests, never direct commits — was the right call, and it's been more important than we initially understood.

We made this choice in early 2024 partly for pragmatic reasons: getting teams to adopt autonomous coding agents required some assurance that a misfire wouldn't produce a production incident. But what we've seen over 2025 is that the PR boundary isn't just a safety mechanism — it's a communication interface. The PR is where the agent explains what it did and why, flags its uncertainty, and asks for human decision on the things it couldn't resolve. Teams that engage with that interface thoughtfully get significantly better outcomes than teams that treat PRs as review formalities.

The PR-first design also turned out to be the right model for building trust incrementally. Teams typically start reviewing every agent PR in detail, approve or reject with comments, and accumulate a feedback loop. After 3-4 sprint cycles, most teams report that they've settled into a review mode that's faster but not less careful — they've learned what to trust and what to scrutinize. That calibration process requires the PR as a legible artifact. It wouldn't work with direct commit access.

What We Got Wrong: Ticket Quality Assumptions

We significantly underestimated how much variance exists in backlog ticket quality across engineering teams, and how much that variance affects agent performance.

Our early product thinking assumed that the agent's job was to handle the implementation side of the gap between ticket and code. What we discovered is that a substantial portion of that gap is on the ticket side: incomplete acceptance criteria, ambiguous requirements, missing context about what "done" means in this specific codebase context.

Human engineers fill that gap through clarifying questions, Slack messages, and implicit understanding built over time working with the same team. Agents can ask clarifying questions, but the interaction overhead of doing so for every ambiguous ticket negates a lot of the throughput benefit.

The solution we've been building toward — better ticket parsing that estimates ambiguity and flags high-risk interpretations before generation — is working, but it took us longer than expected to understand the problem clearly enough to build the right solution. We spent part of mid-2025 building tooling to catch agent errors; we should have spent more of it building tooling to make ticket quality visible.

What the Industry Got Wrong: The "Full Stack Agent" Demo Problem

We watched a lot of agent demos in 2025, and most of them showed something that looks impressive but doesn't reflect how production development actually works. An agent takes a vague prompt, builds a complete functioning feature, tests pass, done.

The problem isn't that this is impossible. The problem is that it requires a level of isolation — no existing codebase to be consistent with, no business domain constraints, no style conventions to follow, no team members whose concurrent changes might conflict — that doesn't exist in real engineering. Demo conditions are almost exactly the wrong conditions to evaluate production-readiness.

Teams that watched these demos and expected production agents to perform at demo levels were disappointed. Teams that evaluated agents against real tasks in real codebases got a more accurate and usually positive picture, but only if they had the patience to go through the onboarding and calibration process rather than expecting immediate results.

We're not claiming our demos are perfectly representative either — every demo involves some amount of favorable condition selection. But there's a difference between optimizing demo conditions and optimizing demo conditions while claiming they represent general production performance. That conflation caused real teams to misallocate their AI tooling budgets in 2025, and it set back adoption in organizations that tried agents at the wrong maturity stage and drew the wrong conclusions.

What Surprised Us: The Review Behavior Shift

One thing we didn't anticipate: how quickly engineering teams shifted their mental model from "agent assists individual engineers" to "agent is a team member that senior engineers manage."

Early in Buildly's history, we assumed the primary interaction pattern would be an engineer pointing the agent at their own tickets and reviewing the output. What we saw instead, especially at teams with more than six or seven backend engineers, is that a lead engineer takes on a management-like role for the agent queue. They triage which tickets are suitable for agent work, review the agent's output across multiple tasks, maintain the feedback loop, and make judgment calls about scope boundary questions.

This is a better model than we designed for, and it's changed how we think about our product. The user interface for managing an agent queue — prioritization, confidence visibility, feedback mechanisms — is a different product than the interface for an individual engineer using an agent assistant. We've been rebuilding around this model in the second half of 2025, and it's clarified a lot of product decisions that were previously ambiguous.

Where the Technology Actually Is

After a year of watching agents work in real codebases: the reliability ceiling for pattern-following work in well-structured codebases is higher than most skeptics claimed and lower than most optimists predicted. For the specific class of well-scoped, well-precedented tasks in codebases with good test coverage, agents are genuinely useful at production-reliability levels. For anything outside that class, they require substantially more human oversight than the demos suggest.

The honest state of the field in late 2025 is that autonomous coding agents are production-ready for a specific and valuable subset of engineering work — not the dramatic "end of boilerplate engineering" story, not the dismissive "it's just autocomplete" story. Something more specific: a tool that multiplies engineering output on the pattern-following work that makes up 40-60% of most backend sprints, freeing senior engineers to apply their judgment where it actually matters.

That's enough to matter. It's also specific enough that teams who understand the real capability boundary get the value, and teams who don't, don't. Making that distinction visible is most of what we're focused on heading into 2026.