Buildly Blog

Why 60% of Your Sprint Is Boilerplate (And How to Stop Writing It)

2025-03-04 · Camille Fontaine

Abstract visualization of task backlog items crowding out feature work

The sprint review ends. Someone asks what shipped. You list the features — two, maybe three. Then someone asks what else the team worked on. You list the other twelve tickets: database migrations, pagination endpoints, new fields on three API responses, a CSV export handler, a webhook consumer. All necessary. None of them features.

We've sat in that room. We built Buildly because we got tired of sitting in that room.

Where the 60% number comes from

Before we started building, we spent several months reading through backlogs — backlogs that teams shared with us, backlogs from our own history, and patterns from the engineering teams we knew well. The breakdown that kept showing up: roughly 55–65% of tickets in a given sprint fell into what we'd call mechanical work. Not trivial work — these tasks required real engineers to execute them correctly. But the work was fundamentally pattern-driven. Given enough context about how the codebase was structured, you could predict what the implementation would look like before you wrote a line.

The mechanical bucket breaks down predictably across codebases:

CRUD endpoints — new resource types, new fields, pagination, filtering. The same shape, every time.
Data migration scripts — backfill a column, rename a field, split a table. Necessary, tedious, completely pattern-matching.
Integration plumbing — consuming a new webhook, connecting to a third-party API, writing the adapter layer between your system and theirs.
Test stub generation — writing unit tests for code whose behavior is already determined by the implementation.

The remaining 40% — the work that requires architectural reasoning, domain expertise, or novel problem-solving — is what senior engineers actually got hired to do.

Why this problem is structural, not a staffing issue

The reflex answer is to hire more engineers. That doesn't fix this. A larger team has a larger backlog. The ratio of mechanical-to-creative work stays roughly constant because it's driven by product growth, not team size. Every new feature your product ships generates downstream mechanical work: new API surface, new data shapes, new integration points. The more successful your product, the more this tax grows.

The second reflex answer is better tooling — autocomplete, code generation assistants, faster scaffolding. These help at the keystroke level. They don't change what's in the sprint. Your engineers still need to read the ticket, understand the context, write the implementation, write the tests, open the PR, respond to review feedback. The cognitive tax of context-switching between mechanical and creative work isn't solved by faster typing.

What actually changes the ratio is removing the mechanical work from the queue entirely, not making it faster to execute.

The cost that doesn't show up in velocity metrics

There's a real cost beyond the obvious throughput hit. Senior engineers doing mechanical work aren't just slower — they're in the wrong cognitive mode. Deep problem-solving requires sustained focus and context accumulation. Writing a pagination endpoint takes 2–3 hours of execution time but also requires loading the relevant codebase context, which costs another 30–45 minutes for someone who was just solving a different problem. Multiply that context-switching cost across 8 mechanical tickets per sprint and you're losing days of effective deep-work capacity from your most expensive engineers.

We're not saying mechanical work is beneath senior engineers. We're saying that the opportunity cost is large and consistently underestimated. When a senior engineer spends Tuesday on a migration script, the architectural review that needed their attention on Wednesday gets pushed. That push compounds.

What autonomous agents can actually fix — and what they can't

The case for autonomous coding agents is straightforward when you frame it correctly: agents are good at executing well-defined patterns in a known context. They're poor at exercising judgment when the problem space is ambiguous or the stakes of getting it wrong are high.

That maps directly onto the 60/40 split. The 60% is where agents should operate. The 40% — greenfield architecture, security-sensitive logic, domain-specific product decisions — is where humans need to stay in the loop, and where we'd actively warn against over-relying on automated output.

The specific mechanics matter too. An agent that generates code and pushes directly to main is a liability. An agent that reads a ticket, writes to your codebase's established patterns, and opens a pull request for human review is a force multiplier. The PR-first model isn't just a safety feature — it's what makes the output useful. Engineers reviewing a finished draft catch errors they'd never catch if they wrote the code themselves, because reviewing puts you in a different cognitive mode than authoring.

A concrete example of the ratio shift

Consider a growing payments platform — a type of codebase we know well from our own backgrounds. At any given time, their backend team has a mix of tickets: adding new payment method types to the API, writing migration scripts for schema changes, building webhook consumers for new processor integrations, and — separately — architectural work like rethinking their transaction retry logic or evaluating a new event streaming system.

The first three categories are mechanical. Not simple — they require someone who understands the existing payment processing domain well enough to not break things. But they're pattern-driven. The retry logic work is not pattern-driven. It requires reasoning about distributed systems failure modes and the specific trade-offs of their infrastructure.

If agents handle the first three categories, three senior engineers who would have spent the sprint on boilerplate are now available to work on retry logic. That's not a theoretical improvement — that's a concrete change to what ships, and what doesn't.

The prerequisite: context quality determines output quality

One thing we learned quickly building Buildly: generic code generation is not the hard problem. The hard problem is generating code that fits a specific codebase. Two teams can both need a pagination endpoint. One uses cursor-based pagination returning ISO 8601 timestamps. One uses page-number offsets with total count metadata. Both have internal naming conventions, both have specific error handling patterns, both have test structures that their CI expects.

An agent that doesn't understand those specifics produces output that engineers immediately reject and rewrite from scratch — which is worse than not having the agent at all, because you've added a review step to work that still needed to be done manually. This is why the quality of the context layer is the actual variable that determines whether autonomous coding is useful or just an expensive linter.

The Style Graph we built for Buildly is our attempt to make codebase context machine-readable in a way that's specific enough to be useful. Not just "this is a Python project using Django" — but "this module uses repository-pattern abstractions with these specific naming conventions, this is how pagination is consistently implemented across the three existing paginated endpoints, these are the test fixtures that exist and how they're structured."

Without that level of specificity, agents produce code that looks right and fits wrong. With it, the PR that opens is something a senior engineer can review in 20 minutes instead of rewrite from scratch.

What changes when the ratio shifts

We've watched teams adjust to having mechanical work handled by agents, and the changes aren't always what you'd expect. The obvious change is throughput — more tickets closed per sprint. The less obvious change is how engineers spend their review time. When PRs are generated rather than written by hand, reviewers shift from "is this implementation correct?" to "is this the right approach?" Those are different questions. The second question is harder and more valuable.

The sprint review conversation changes too. Instead of listing twelve mechanical tickets alongside three features, you're describing the features and the architectural decisions behind them. The mechanical work still happened — it's in the merged PRs — but it's not the story of the sprint anymore.

That's a different engineering culture. Not better by default — there are real questions about skill atrophy, about what engineers learn from writing mechanical code, about whether reviewing agent-generated code builds the same judgment as writing it yourself. Those questions deserve honest answers. We'll get to them in a future post.

For now: if 60% of your sprint is pattern work, and you're not asking whether that work needs a human author or just a human reviewer, you're leaving capacity on the table that you'll eventually wish you had.