Rivebrook Financial is a mid-market payments platform based in Chicago — 40 engineers total, with a 12-person backend team responsible for the merchant data infrastructure that everything else in the product depends on. Their engineering lead reached out to us with a message we've received variants of many times: "We have 12 backend engineers and we're spending most of our sprint capacity maintaining CRUD operations for our merchant data model. Our roadmap is stalled. What would you actually do with this?"
We spent 30 minutes on a call going through their backlog structure, their repository layout, and their current sprint composition. Then we put together a specific proposal rather than a generic one. Six weeks later, the composition of their sprints looked meaningfully different.
This is an account of what happened — what we tried, what worked, what didn't, and what the numbers looked like at the end of the engagement. Rivebrook gave us permission to share specifics, though we've described the company in general terms rather than naming their product lines.
What the Backlog Actually Contained
Before we started, we asked the engineering lead to pull the past 3 months of completed tickets and categorize them roughly. The result was illuminating but unsurprising: about 58% of backend engineer hours over that period went to what they called "merchant data work" — endpoints for creating, reading, updating, and deleting merchant records, their associated payment methods, their transaction history queries, and the migration scripts that kept the schema in sync with product changes.
This wasn't a failure of prioritization. Merchant data is genuinely important; it's the core model everything else in Rivebrook's platform depends on. But the work itself was almost entirely pattern-following. Every new merchant attribute meant a new migration, new validation logic, a new endpoint handler or two, updated serializers, and documentation updates. The same structural pattern, executed correctly but without any interesting engineering judgment required.
The remaining 42% of their capacity was where the interesting work lived: the fraud detection logic, the reconciliation system, the integration with external payment networks. The work the engineers actually wanted to be doing, and the work that created competitive differentiation for Rivebrook's payments product.
The First Two Weeks: Onboarding and Calibration
We connected Buildly to their repository and spent the first two weeks in onboarding mode. Rivebrook's codebase had roughly 140,000 lines of Python and TypeScript, a fairly consistent internal style (they'd been disciplined about code review standards since founding), and a well-maintained test suite. Good conditions for the Style Graph to work with.
During the onboarding window, we did the calibration exercise described in our onboarding documentation: asked the agent to regenerate 25 historical merchant data PRs from Rivebrook's commit history, then compared the output to what the engineers had actually written. The first pass was acceptable but not impressive — the agent missed a few validation conventions specific to their payment domain and used a slightly different error type than their codebase convention. Both issues were minor and fixable by editing the Style Graph's pattern weights.
After one calibration pass, the regenerated PRs were noticeably closer to the team's actual style. We considered this sufficient to start running live tasks.
Weeks Three and Four: Production Tasks on Low-Risk Tickets
We started with the lowest-risk class of tickets in Rivebrook's backlog: new merchant attribute additions that followed a fully established pattern. Add a field to the merchant model, write the migration, add validation, add the endpoint handler, update the serializer, add tests.
The first live task opened a PR in about 90 minutes. Rivebrook's engineering lead reviewed it and approved it with two minor comments: one was a style preference we hadn't captured (they liked a specific format for migration file docstrings), one was a genuine catch (the agent had missed a backward compatibility constraint in the serializer that was specific to Rivebrook's payment processor integration). Both were fixed in a follow-up commit.
Over weeks three and four, the agent ran 34 tasks of this type. Engineers reviewed and approved 29 of them with minimal changes. Four required substantive rework due to edge cases the agent hadn't anticipated — mostly around Rivebrook's transaction history query structure, which turned out to be more complex than it appeared from the backlog ticket descriptions due to some legacy schema decisions made early in the product's history. One was closed without merging because the underlying ticket had been incorrectly specified.
The docstring format preference and a handful of other observations from reviews got folded back into the Style Graph. By week four, the approval rate without substantive rework was sitting around 88%.
Weeks Five and Six: Expanding Scope
With low-risk ticket performance established, we expanded to slightly more complex work: multi-step schema migrations, endpoint deprecations, and a handful of integration updates within Rivebrook's payment network connectors. These tasks required the agent to touch more files, make more cross-service inferences, and handle requirements that were less clearly specified in the tickets.
Performance dropped initially — the agent opened draft PRs with uncertainty flags more often, and the review time per PR went up. This was expected. Rivebrook's backend engineers adapted: they got faster at resolving the uncertainty flags, and several of them started writing more specific ticket descriptions after noticing that precision in the ticket correlated with precision in the PR.
By the end of week six, the team was running 15–20 agent tasks per sprint alongside their regular work. The sprint composition had shifted: merchant data maintenance work — the CRUD-heavy category that had consumed 58% of backend hours — was down to about 17%. That 40+ percentage points of recovered capacity went into the fraud detection and reconciliation work that had been sitting in the backlog for two sprints.
What the Numbers Mean (and What They Don't)
The "70% reduction in CRUD work" figure comes from Rivebrook's own sprint tracking: the hours their 12-person backend team logged against the merchant data maintenance category dropped from roughly 58% of total sprint hours to around 17% over six weeks. That's a meaningful shift in how a team spends its time.
We're not claiming this engagement proved that Buildly is universally applicable across fintech development. The conditions at Rivebrook were favorable: a consistent codebase style that had been enforced for years, good test coverage, well-maintained tickets in a Linear board with clear acceptance criteria, and an engineering lead who was engaged and gave quality feedback during onboarding. Teams with inconsistent style or minimal test coverage will see slower improvement curves — the Style Graph needs something clean to learn from.
The other thing worth noting: Rivebrook's engineers spent more time doing code review in the second half of the engagement, not less. That's the expected model — agent output goes up, review output goes up proportionally. The value came from engineers redirecting their judgment from writing boilerplate to reviewing it. If the team had started treating agent PRs as rubber stamps, the quality signal would have degraded quickly. They didn't, and that discipline is a significant part of why the numbers held.
What We'd Do Differently
Two things in retrospect. First, we should have pushed harder on ticket quality at the start. The four tasks that required substantial rework in weeks three and four were all traceable to vague or incomplete ticket descriptions — a known issue in Rivebrook's backlog that the team was already aware of. We could have flagged this during onboarding by running the calibration exercise against poorly-specified historical tickets rather than well-specified ones. The calibration pass would have surfaced the ambiguity problem earlier.
Second, the Style Graph calibration took longer than it needed to because we did it in sequence with the live task rollout. In subsequent engagements, we run a more intensive calibration pass before going live rather than interleaving calibration and production work. The data from Rivebrook's onboarding informed that process change, which we've carried into later work.