Buildly Blog

PR-First Agent Design: Why We Open Pull Requests Instead of Deploying

2025-04-11 · Buildly Engineering

Abstract representation of a pull request review workflow

When we were designing the Buildly agent's write model, the question wasn't whether to require human review. It was how to make human review as frictionless as possible without removing it. PR-first — the constraint that every agent write action produces a pull request and nothing else — was the answer we settled on, and it's shaped everything else we've built since.

This post explains the reasoning behind that choice: why it's not just a safety feature, why it changes the nature of human-agent collaboration, and what it costs.

The failure mode we were trying to avoid

Early in building Buildly, we experimented with a more permissive write model. The agent could commit to feature branches directly and, with appropriate configuration, could even merge to a staging branch without opening a PR. The idea was speed: fewer steps between ticket and deployed code.

What we saw in practice was trust erosion, not speed gains. When code appears in the codebase without a review event attached to it, engineers respond in one of two ways. They either stop reviewing entirely — treating the agent as autonomous and assuming its output is correct — or they start doing more review, not less, because they feel less confident about what's in their codebase.

Neither outcome is good. The first trades quality assurance for throughput. The second defeats the point of automation. The PR-as-review-event is not overhead. It's the mechanism by which engineers maintain calibrated awareness of what's being added to the codebase. Remove it, and that awareness degrades.

Why the PR is the product, not a delivery mechanism

This reframe took us a while to articulate clearly, but it matters: in the Buildly model, the pull request is not how we deliver code changes. The pull request is what Buildly produces. The output of an agent task is a reviewable PR, not merged code. Merged code is something a human chooses to do after reviewing the PR.

That framing changes how you think about the agent's success criteria. The question isn't "did the agent produce working code?" The question is "did the agent produce a PR that an engineer can review quickly, understand fully, and make a confident merge decision on?" Those are related but distinct bars. Working code that arrives in a 2,000-line PR with no description is worse than working code in a focused 80-line PR with a clear description of what was changed and why.

Buildly agents write PR descriptions the same way they write code: against your codebase's patterns. If your team consistently writes PR descriptions with a Summary + Testing Notes format, the agent will follow that format. If your team uses linear issue links in PR descriptions, the agent links the originating ticket. The PR as a communication artifact is as important to us as the code changes themselves.

Scope containment: one ticket, one branch, one PR

The PR-first model enforces a scope discipline that's valuable independent of the review requirement. Each agent task maps to exactly one pull request. The agent doesn't consolidate multiple tickets into a single PR. It doesn't generate a PR that spans multiple layers of the stack if the ticket only touches one.

This constraint exists because scope creep in PRs is one of the most consistent sources of review friction, regardless of who wrote the code. A PR that touches three unrelated modules takes three times as long to review and has three times the surface area for something to go wrong. An agent that could theoretically "complete" a ticket more efficiently by making related changes it noticed along the way would be producing worse output, not better.

The tradeoff we're accepting here is that an agent might open a PR that a thoughtful human engineer would have bundled with adjacent changes. We think that's the right call. The reviewer can always suggest follow-up tickets. Bundled scope is much harder to un-bundle.

How PR-first changes the review experience

Something interesting happens when engineers shift from reviewing each other's code to reviewing agent-generated code. The nature of the review changes. When a colleague writes code, the reviewer is evaluating their judgment, their understanding of the requirements, their style choices. There's a social dimension — pointing out a style deviation feels different from pointing out a logical error.

When an agent writes code, that social dimension disappears. The reviewer is purely evaluating the output. This turns out to reduce friction on some types of feedback. Engineers who might soften a style comment to a colleague will just add it to an agent's PR without hesitation. This produces cleaner feedback cycles and more honest review. The agent doesn't have feelings to manage.

It also changes what reviewers pay attention to. When you know the code was written by following patterns extracted from your codebase, you stop checking whether it follows your conventions — you already know it does. Your attention shifts to whether the implementation logic is correct and whether the approach is the right one for the ticket. That's a higher-value review mode.

What happens when the agent is uncertain

Not every ticket has a clean answer. Sometimes the requirements are genuinely ambiguous. Sometimes the relevant module has multiple implementation patterns and it's not clear which the agent should follow. Sometimes the ticket implies a change that would require touching something outside the declared scope.

In these cases, PR-first gives us a natural escape hatch: the agent opens a draft PR with a description that surfaces the uncertainty. "I've implemented the ticket as described, but note that the existing handler for similar requests uses approach A, while a different handler in the same module uses approach B. I went with A — let me know if you prefer B and I'll update." That's a better outcome than the agent making a silent choice, and it's a better outcome than the agent not completing the task at all.

The draft PR pattern is something we lean on deliberately. For tasks where the agent has high confidence, it opens a PR ready for review. For tasks where the agent is working with partial information or has made a judgment call, it opens a draft and flags the uncertainty in the description. Engineers reviewing draft PRs know to read more carefully. Engineers reviewing ready-for-review PRs know the agent's confidence threshold has been met.

The cost: throughput ceiling on truly automated pipelines

We should be honest about what PR-first gives up. If your goal is a fully automated pipeline where code is written, tested, and deployed without human involvement, Buildly's model is not the right fit. The PR-first constraint means there is always a human review step between agent output and merged code. That's a throughput ceiling for organizations trying to run completely lights-out code generation.

We made this tradeoff deliberately and we'd make it the same way again. The teams we work with are not trying to remove humans from the software development process. They're trying to make their engineers more effective. Eliminating review doesn't make engineers more effective — it removes them from the loop on their own codebase, which has compounding negative effects on code quality, on engineers' understanding of their own systems, and on the ability to catch subtle errors that automated tests don't cover.

The right question for most teams isn't "can we automate the review step?" It's "can we make the review step fast enough that it doesn't become the bottleneck?" A well-scoped, well-described PR for a boilerplate ticket should take 15–20 minutes to review, not two hours. If the PR-first model is slowing your team down, the problem is PR quality or scope discipline, not the requirement to review.

What we're still refining

The PR-first model is settled. The details of what makes a PR maximally reviewable are still evolving. How much context should the description include? Should the agent inline test coverage notes in the description, or let the CI report speak for itself? How do we handle PRs that are blocked on upstream changes and can't be cleanly rebased?

These are the kinds of questions we're actively working through. The core design — that Buildly's write operation is a PR, not a commit — is not something we're revisiting. It's the constraint that makes everything else in the product trustworthy.