Buildly Blog

Ticket Parsing and Ambiguity Tolerance: How Agents Handle Vague Requirements

2025-06-06 · Buildly Engineering

Visual showing ambiguous input being resolved into clear output paths

Nobody writes perfect tickets. This is not a complaint — it's a structural reality. A ticket is written before implementation, usually by someone who isn't doing the implementation, often before the full scope is understood. Tickets are approximations. They describe intent, not specification.

An autonomous coding agent that can only work from perfect tickets is not useful. A real backlog is full of tickets like: "Add export functionality to the reports module" (export to what format? all reports or specific ones? what's the file naming convention?). An agent that stalls on these tickets creates a new bottleneck instead of removing one. An agent that powers through and silently makes wrong assumptions creates a different problem — PRs that are technically complete but wrong, which engineers have to rewrite instead of review.

Getting ambiguity handling right is one of the harder engineering problems in building Buildly. Here's how we think about it.

The three categories of ambiguity

Not all ambiguity is equal. We distinguish between three types, each of which warrants a different response from the agent.

Resolvable ambiguity

Some ambiguity disappears when you look at the codebase. "Add pagination to the subscriptions endpoint" doesn't specify the pagination style — but if the Style Graph shows that every other paginated endpoint in the codebase uses cursor-based pagination with a specific response shape, that's a reasonable inference. The agent can proceed, document the inference in the PR description, and the reviewer can confirm or correct it.

This is the most common category. A lot of ticket ambiguity is not actually ambiguity when you have strong codebase context. The team has made these decisions before. The Style Graph captures those decisions. The agent inherits them.

Requiring clarification before starting

Some ambiguity genuinely cannot be resolved from codebase context and has a meaningful impact on the implementation. "Add rate limiting to the public API" — rate limiting per IP? Per API key? Per user account? Those are three different implementations with different data models, different middleware concerns, and different operational implications. Guessing wrong and implementing one when the team meant another wastes the review cycle and potentially creates rework.

For this category, the agent comments on the ticket directly with a specific, minimal clarifying question. Not "please provide more details about this ticket" — that's useless. "Should rate limiting be applied per IP address or per API key? The current auth model uses per-key auth, which suggests per-key limiting, but IP-based limiting is common for public endpoints. Which is preferred?" That question gives the ticket author enough context to answer quickly.

Flag for human judgment, don't proceed

Some tickets imply scope or consequences that require a senior engineer's judgment before anyone — human or agent — starts implementing. "Migrate user sessions to Redis" is a ticket with infrastructure implications, rollback complexity, and operational risk that warrant explicit architectural review before implementation begins. The agent shouldn't attempt this, and shouldn't block on a clarifying question either — it should flag that the ticket warrants human architecture review before a task assignment.

This category is the hardest to classify automatically. The signal we look for: tickets that involve changes to infrastructure, security primitives, data models with backfill implications, or cross-service interfaces. When those signals are present, the agent adds a comment noting that it's pausing pending human review of scope and approach.

How the classification works in practice

The classification pass runs before any code is touched. It's a lightweight reasoning step: given the ticket's intent, given the codebase context, given any prior comments or linked tickets, how much can be inferred vs. what is genuinely unknown, and does the unknown matter?

The "does it matter" question is the key one. A lot of ticket ambiguity is genuinely unimportant — it doesn't affect the implementation, or the most obvious interpretation is correct. An agent that asks clarifying questions on every non-fully-specified ticket creates friction, not leverage. The threshold for asking a question should be: if I implement the most reasonable interpretation and I'm wrong, would a reviewer have to reject the entire PR rather than just request minor changes? If the answer is yes, ask first.

We've tuned this threshold over time. Early versions of the classifier were too conservative — asking questions that the Style Graph could have answered, or flagging tickets as ambiguous when the right answer was obvious from codebase context. That created friction and eroded the team's trust that the agent was capable. The current version asks questions roughly 12–18% of the time, flags for human review roughly 5% of the time, and proceeds with an inference for the remaining 77–83%. Those ratios vary by team and backlog quality, but this gives a sense of scale.

Handling contradictions

A specific sub-class of ambiguity: contradictory requirements. A ticket that says "add field X to the response" where field X conflicts with an existing uniqueness constraint, or violates an invariant that's documented in another ticket's comments from six months ago. The agent has read those comments too — they're in the codebase history that feeds the Style Graph.

When the agent identifies a contradiction between the ticket and the existing codebase or established patterns, it surfaces it explicitly. Not as a blocker — it can still open a draft PR with the implementation that fulfills the literal ticket requirements — but the PR description calls out the contradiction. "This implements field X as specified, but note that the UserProfile model currently enforces uniqueness on the combination of [fields]. Adding X without also updating that constraint may cause unexpected behavior in [affected endpoint]. Flagging for consideration before merge."

This is where having the codebase context deeply embedded in the agent matters. A generic code generation tool wouldn't know about the uniqueness constraint. An agent with a well-built Style Graph does.

What teams need to do differently

Working with Buildly does create some feedback pressure on how teams write tickets. Not dramatic changes — but a few things matter.

The most important: tickets that will be picked up by agents benefit from specifying the expected behavior, not just the implementation intent. "Add filtering" is less useful than "Add filtering by status field, accepting values of [active, paused, cancelled], ignoring invalid values rather than returning 400." The agent can work from the first, but will need to make more inferences and will likely ask a clarifying question. The second produces a clean, fast task execution.

The second: linking related tickets matters. If a ticket depends on a previous architectural decision captured in a closed ticket, that link helps the agent inherit the context. Closed tickets become knowledge, not just resolved work items. Teams that link well give the agent a richer information environment to work from.

We're not trying to tell engineering teams they need to write better tickets before they can use Buildly — that's a prerequisite that would prevent most teams from getting started. The agent is designed to handle real-world ticket quality, which is imperfect. But teams that iterate toward more specific tickets tend to see better agent output over time, which creates a useful feedback loop.

The goal: fewer surprises in both directions

What we're trying to avoid is the agent creating surprise work in either direction. A surprise where the agent gets stuck and blocks a ticket that should have been handled — that's a failure mode. A surprise where the agent proceeds silently, makes a wrong guess, and generates a PR the reviewer has to reject entirely — that's a different failure mode, and arguably worse because it consumes review time.

The right outcome is predictable behavior: clear tickets get fast PRs, ambiguous-but-resolvable tickets get PRs with surfaced inferences, genuinely ambiguous tickets get specific clarifying questions, and high-risk tickets get appropriate flags. Predictable behavior is what lets teams calibrate their workflow around the agent rather than being surprised by what it does or doesn't do.