Buildly Blog

The False Positive Problem in Automated Code Review

2025-08-07 · Buildly Engineering

Abstract precision and accuracy visualization

There's a number we track internally that doesn't appear on any marketing page: the rate at which engineers who review Buildly PRs find a real bug in the generated code. Not a style issue. Not a missed edge case they wanted handled differently. An actual bug — something that would fail in production or produce incorrect behavior.

We call this the false positive rate, borrowing from signal detection terminology. A "false positive" in our context is a PR that passed our internal checks, looked correct, and still contained a defect that a human reviewer had to catch. We want this number as low as possible, because every time it fires, trust erodes — and trust is harder to rebuild than it is to maintain.

Why False Positive Rate Matters More Than Throughput

Agent throughput is easy to measure and easy to talk about. An agent that opens 50 PRs per week looks impressive. But if 8 of those 50 contain bugs subtle enough to pass initial review, the engineering team is doing 8 additional rounds of debugging on code they didn't write and don't fully understand yet. That's worse than writing the code themselves.

The economics shift dramatically at different false positive rates. At 5%, engineers mentally model the review as "assume correct, scan for edge cases." At 15%, they mentally model it as "assume broken somewhere, find it." These two review behaviors have different time costs and different emotional costs. The second one is exhausting, and teams that experience it long enough simply stop using the tool.

This is why we don't optimize primarily for generation speed or PR volume. We optimize for the rate at which generated code is genuinely correct on first review. Everything else follows from that.

Where Our False Positives Come From

We've categorized the bugs in agent-generated PRs that made it past our pre-submission checks. The breakdown isn't what you might expect.

The largest category — roughly 40% of defects we've observed — involves correct code that's wrong for the specific codebase. The agent generates a valid implementation of the requested logic, but it doesn't account for an invariant or constraint that exists in this particular codebase, not in the general problem domain. An off-by-one that the existing code always compensates for. A null check that's unnecessary given the upstream validation that already happens. The Style Graph catches many of these, but not all — invariants that aren't expressed in type signatures or comments are difficult to infer.

The second category, about 30%, is state management errors. The agent correctly handles the happy path but misses a state transition that the existing code relied on. These are particularly hard to catch in review because they often only manifest under specific timing or load conditions.

The remaining 30% is a mix: missing error handling in edge cases, incorrect assumptions about API contracts for dependencies, and occasionally outright hallucinated method signatures that happened to compile due to dynamic typing.

What We've Built to Reduce Each Category

For the "correct but wrong for this codebase" category, the main defense is Style Graph depth. The richer our semantic model of the codebase's invariants, the more context the agent has when generating code. We've added explicit invariant annotation: when we observe a pattern consistently enough across a codebase (always check X before calling Y, never return Z without setting W), we encode it as a codebase-specific constraint rather than leaving it as implicit context.

For state management errors, we run static analysis on generated code against the existing state machine patterns we've identified. If the agent's code introduces a new code path that doesn't account for all states the existing code expects, that's flagged before the PR is opened. This catches a significant fraction of state transition bugs before review.

For the miscellaneous category, our biggest lever has been stricter confidence gating. When the agent's internal confidence on a particular code segment falls below a threshold — usually because the relevant context was ambiguous or the pattern is novel in this codebase — we add an inline comment in the PR flagging that section explicitly. Rather than presenting uncertain code as confident code, we surface the uncertainty so the reviewer knows where to focus.

The Reviewer Behavior We're Trying to Enable

There's a specific review mode we want engineers to be in when they look at a Buildly PR: "trust but verify at the boundaries." That means trusting the core logic in familiar patterns, and focusing attention on the parts that are novel or that interact with complex codebase-specific behavior.

This is a real cognitive skill, and it's different from reviewing human-written code. When a colleague writes a PR, you have a model of how they think, what they miss, and where they get creative. You know Sarah always forgets to handle the empty list case, and you know David's implementations are conservative but verbose. You adjust your review accordingly.

With agent-generated code, you're building that model from scratch, and it's statistical rather than individual. The aggregate is reliable for certain patterns and less reliable for others. We try to make that pattern explicit in our PR output — which parts of the generated code are high-confidence (pattern appears frequently in the codebase's history, invariants are well-modeled) versus lower-confidence (novel approach, limited prior examples, cross-service interaction).

The Honest Limit of Automated Checks

We're not going to claim that our pre-submission checks catch everything — they don't, and the categories above make clear why. Some bugs are only visible to someone who understands the business logic, the deployment environment, or the downstream dependencies in ways that static analysis and style matching can't fully capture.

What we can do is make sure that the bugs that do make it through are the hard ones — the ones that require genuine domain knowledge to catch, not the ones that a slightly more careful agent should have caught. If the false positive rate we're seeing is dominated by domain-knowledge defects, that's a sign the automated checks are working. If it's dominated by pattern-matching failures or obvious structural errors, that's a sign something in the pipeline needs tuning.

The practical target we've been working toward: a false positive rate low enough that a senior engineer can spend 15–20 minutes reviewing a typical Buildly PR and be genuinely confident in what they approve. Not infinitely cautious, not resigned to re-reviewing every line — confident. That's the trust level that makes agent-assisted development sustainable, and it's the number we care about more than anything else in our evaluation suite.