Buildly Blog

Onboarding a New Codebase: What the First 48 Hours Look Like

· Buildly Engineering
Visual of a new system mapping and learning a large codebase

The first thing a Buildly agent does when it connects to a new repository is not generate code. It reads. For somewhere between 4 and 36 hours depending on codebase size, it does nothing but analyze: parsing commit history, traversing import graphs, cataloging naming patterns, identifying structural conventions, and building what we call the Style Graph — a semantic representation of how this specific team writes code.

This front-loaded analysis phase is the thing that separates useful code generation from noise. A model that hasn't done this work will produce syntactically valid code that reads like it was written by someone who skimmed the README. A model that has done this work produces code that looks like it was written by a new engineer who spent a week reading the codebase before touching anything.

What the Style Graph Captures

The Style Graph is not a static analysis output. It's richer than what you'd get from running ESLint or a type checker, and it captures a different class of information.

At the structural level, it records how modules are organized, how services communicate, where error handling is centralized versus distributed, and how the team has chosen to separate concerns. These aren't rules in any linter — they're patterns that emerge from hundreds of design decisions accumulated over the codebase's history.

At the naming level, it records how entities are named across different contexts: what a "user" is called in the database layer versus the API layer versus the UI layer, whether the team uses camelCase or snake_case in different environments, how methods are named when they retrieve data versus when they transform it. Naming is a surprisingly strong signal for "does this code look like it belongs here."

At the pattern level, it records recurring code templates: how the team writes REST endpoint handlers, how they structure database queries, how they handle async operations, what error types they define and where. When an agent needs to write a new endpoint handler, it generates against these templates rather than against a generic model of "what a REST endpoint handler looks like."

What Happens in the First 24 Hours

The first phase of onboarding is raw ingestion. We pull the full commit history (bounded at 18 months by default, configurable), the current file tree, and all dependency manifests. This is mostly I/O — reading, not processing.

The second phase is structural analysis: building the import graph, identifying service boundaries, mapping which modules are leaves versus which are internal packages versus which are shared utilities. For a mid-market SaaS backend with 150,000 lines of code, this typically takes 2–4 hours.

The third phase is pattern extraction. This is where the compute lives. We run classifiers over the code corpus to identify recurring patterns, cluster similar implementations, and detect where the team's conventions are consistent versus where they're inconsistent (indicating either evolution over time or different authors with different preferences). For the same 150,000-line codebase, this phase typically runs 8–16 hours.

By the end of hour 24, we have a Style Graph that's roughly 70–80% of its eventual quality. Adequate for low-risk tasks in well-patterned modules; not yet ready for complex cross-service work.

The Second 24 Hours: Validation and Calibration

The second phase of onboarding is validation. We don't rely solely on the Style Graph's internal consistency — we test it against historical PR data.

The approach: take 20–30 closed PRs from the past 3 months, strip out the code changes, and ask the agent to regenerate them from the ticket description and the pre-change codebase state. Then compare the agent's output to what the human engineer actually committed. We're not expecting exact matches — we're looking for semantic similarity in approach, structural similarity in pattern, and absence of obvious style violations.

This calibration step tells us several things. Where the agent's output consistently diverges from human output in the same structural direction, that tells us the Style Graph has captured a real pattern but the agent is over-applying it. Where the output diverges randomly, that tells us the relevant pattern wasn't captured well. Both cases produce targeted adjustments to the Style Graph's weights.

By end of hour 48, the Style Graph has gone through at least one calibration pass. At that point, we consider the codebase "ready for production tasks" — not fully calibrated, but ready for the agent to start generating real PRs with appropriate confidence gating on novel patterns.

What We Surface to the Engineering Team

After the 48-hour onboarding window, we produce a Style Graph summary report. This is not a wall of technical output — it's a document meant to be read by the engineering lead or a senior engineer who'll be reviewing agent PRs.

The report covers: the patterns the agent has high confidence in (it will generate reliably consistent code for these), the patterns with detectable inconsistency in the codebase (likely spots where the team has different habits or where the codebase has evolved), and the domains where the agent has limited training examples and will require more review attention initially.

The inconsistency section is often the most valuable part of the report for engineering teams, independent of the agent work. A backend team we onboarded earlier this year discovered through the Style Graph summary that two different error handling approaches had crept into their API layer — one used by the original engineers, one adopted by later hires who came from a different background. Neither was wrong, but the inconsistency was making the codebase harder to navigate. The summary triggered a brief cleanup sprint that had nothing to do with agent usage.

When the Style Graph Falls Short

We're not going to claim the Style Graph captures everything. It doesn't, and it's worth being explicit about the categories it misses.

Business domain logic is the main gap. The Style Graph knows how the team writes code; it doesn't know why certain constraints exist in the domain. If there's a rule that "payment amounts must always be stored as integers representing cents, never as floats" — that's a business invariant, not a code pattern. We won't infer it from code structure alone. It needs to be explicitly documented somewhere in the codebase (a comment, a type alias with a name that signals intent, a validation function with a clear docstring) for the Style Graph to pick it up.

Recent architectural decisions are another gap. If the team made a significant structural decision in the past two weeks, the commit history is too thin for the agent to have high confidence in that pattern yet. The calibration weights for new patterns increase with repeated examples — a single example gets low weight.

The implication: teams that want to get maximum value from Buildly early in the onboarding process should make their domain invariants explicit in code (types, constants, validation functions with clear names) and give the agent time to accumulate examples of new architectural patterns before relying on it heavily in those areas. This is good engineering practice regardless of whether you're using autonomous agents — code that makes its invariants explicit is easier for humans to read too.

Back to Blog Request Demo