How We Evaluated and Adopted Agentic Coding Across Our Engineering Org

The Starting Point

There's a version of this article I could write that would feel more impressive: visionary leadership, AI transformation, 10× productivity, the future of software engineering. I'm not going to write that version.

The version I'm going to write is the one that actually happened — a structured multi-vendor evaluation, a cost analysis that surprised us, eleven validated workflows now in production, and an org-wide rollout that took longer than expected because real adoption across 50+ engineers is slow and requires patience.

By late 2025, every engineer on my team was using some form of AI coding assistance. Some were on Claude, some Copilot, a few experimenting with Cursor. No consistency, no shared practices, no way to understand what we were actually getting for the money. The problem with ad-hoc AI adoption isn't that the tools don't work — some of them work very well. The problem is that you can't build team-level workflows on individual tool choices. The high-value use cases require everyone working from the same tooling and the same playbook.

So we decided to do it properly.

Defining the Target

Before we could evaluate anything, we had to define what we were evaluating. "AI coding tools" meant different things to different engineers on the team.

We landed on a working definition: agentic coding tools are tools that don't just complete your current line — they reason about your intent, read context across files, and take multi-step actions autonomously. The shift from autocomplete to agent is the meaningful one.

We explicitly excluded single-file autocomplete from this evaluation. That ship had sailed — most engineers were already using something. We were evaluating the next layer.

The Evaluation: Claude vs. Codex vs. Gemini

We ran structured evaluations of Claude (Anthropic), Codex (OpenAI), and Gemini (Google) against our actual codebase and actual work patterns. Not generic benchmarks. Tasks from our real backlog.

We weighted heavily for performance on legacy code — because that's what we're working with. The legacy frontend is a React monolith. The payments platform has years of accumulated complexity. Any tool that looks great on greenfield projects but falls apart in the presence of real-world debt wasn't useful to us.

What we found:

Claude performed best on the tasks that mattered most — specifically, multi-file reasoning and coherent execution of complex plans across a legacy codebase. Its output quality on our actual code was consistently higher than alternatives. It was also the most reliable at flagging uncertainty rather than hallucinating confidently.

Codex was strong on self-contained tasks but struggled with the cross-file context requirements of legacy work. It produced more plausible-but-wrong suggestions in complex situations — the kind of output that passes a quick review and fails in production.

Gemini was capable but the tooling integration wasn't at the level we needed for the workflows we'd planned. Gemini alignment is still on the org-wide roadmap — we don't want to exclude Google entirely given our broader Google tooling footprint, but the timing wasn't right.

We selected Claude as the primary agentic coding tool for the team.

That said, the evaluation gave us a clearer picture of what each tool is actually for. The community shorthand that matches our experience: Claude wants to get something working and done — it's collaborative, proposes a plan, and asks before executing. Codex is the stubborn workhorse — fire-and-forget, runs autonomously in a cloud sandbox, and is genuinely better at CI-style batch tasks and terminal automation. Gemini is the fast generalist — the most generous free tier of the three, the widest context window, and the right choice for exploratory work when you're not sure what you're looking for yet.

We didn't abandon the others. We route by task type: Claude for multi-file refactoring and anything requiring architectural reasoning, Codex for background CI tasks and anything we want to run async, Gemini when an engineer wants to explore a large codebase without consuming Claude's capacity. The $125/month Claude seat cost is justified by the high-value use cases. Gemini's free tier covers the exploratory work that would otherwise burn through that budget.

The Cost Math

The cost analysis changed how I thought about this.

The question we were trying to answer: compared to using the API directly, what does a Claude seat actually cost?

Claude seat: $125/month per engineer.

To get equivalent usage through the API — covering the volume of context reads, multi-turn agentic sessions, and code generation an active engineer does — you're looking at $1,500+ per month in API costs.

That's a 12× ROI on the seat cost. Before you even count productivity gains.

The way I explained it to leadership: the seat pricing is Anthropic subsidizing agentic workflows because they're betting on long-term adoption. At API rates, most of the high-value use cases would be economically marginal. At seat pricing, they're clearly worth it. This math required no assumptions about productivity improvements — it was just a cost comparison between two ways to access the same capability. The productivity gains are real, but they're upside on top of a baseline that already makes sense.

The Pilot: 13 Engineers, 6 Weeks

We ran a structured 6-week pilot with 13 volunteers from across the team — a deliberate mix of seniority (2 seniors, 4 mid-level, 4 junior, 3 leads), function (frontend, backend, infra), and skepticism level. We specifically recruited two engineers who had told us AI tools were hype.

What we measured:

Task completion time on standardized ticket types (bug fix, feature slice, refactor)
PR review cycles — did AI-assisted PRs require more or fewer revision rounds?
Self-reported confidence on unfamiliar codebases
Bugs introduced per 1,000 lines of AI-assisted code vs. baseline
Developer satisfaction (weekly 5-question survey)

What we didn't measure: lines of code written — a useless metric that punishes refactoring and rewards bloat. We had to explicitly tell people this wasn't the goal.

Week 1 was chaos. Everyone had a different setup, different prompting style, different mental model. Output quality varied wildly — not because the tools were inconsistent, but because the engineers were.

The engineers who got the most out of agentic tools shared one trait: they treated the tool like a junior engineer who needed clear context, not a magic box they typed vague requests into. The ones who struggled were trying to offload thinking rather than amplify it.

We ran a 90-minute session at the end of week 2 where pilot members shared their most useful prompts, their worst outputs, and their workflow changes. That session was more valuable than any vendor demo we sat through.

How We Actually Use Claude Code Day-to-Day

Six workflows is the right-sized list for team-level adoption, but it understates how differently engineers use the tools once they're fluent. There are a few practices that changed the baseline before any specific workflow comes into play.

Plan Before You Build

Plan Mode (read-only exploration, no file edits) is mandatory before touching more than one or two files. The discipline sounds simple and takes real enforcement to stick. What it produces: a shared mental model of what's changing and why, before any code is written.

Our standard 4-phase loop: Explore (Plan Mode — Claude reads the relevant files, no changes) → Plan (Claude produces a specific implementation plan, engineer reviews and edits it directly) → Implement (engineer approves, Claude executes against the plan) → Commit and PR (Claude writes the commit message and opens the PR with the diff as context). The value of this loop is that bad assumptions surface in the Plan phase, not after two hours of wrong-direction implementation.

Context Hygiene

Context quality degrades past roughly 100,000–120,000 tokens. This is the threshold where Claude Code starts dropping instructions, making inconsistent decisions, and giving outputs that were clearly better earlier in the session. We call it context rot, and it's real.

The practice: monitor context usage, run /compact — directed compaction, not blind — before hitting 70% capacity, and /clear before switching to a different task. The key word is directed: /compact focus on the API changes and the list of modified files gives Claude a coherent compressed summary. Blind compaction produces a summary that loses the specific constraints that were keeping the session on track.

On the LLM Wiki / Graphify workflow we describe below — the 71.5× token reduction is partly explained by this. A well-structured wiki dramatically extends the useful life of a session by keeping the context lean before the actual work begins.

Verification as a Quality Multiplier

Giving Claude a way to verify its own work improves output quality 2–3× — this is a documented pattern from Anthropic's own usage data, and it matches what we see. The implication: don't just ask Claude to implement. Ask it to implement, write tests that verify the implementation, run the tests, and fix failures before reporting done.

For the UI-side work, we use the Chrome extension to let Claude open the browser, test the interface, and iterate until it passes its own visual check. For backend work, the test suite serves the same role. The principle is the same: the agent performs dramatically better when it can observe the result of its own actions rather than generating output into a void.

The Validated Workflows

We didn't try to document everything AI can theoretically do. We validated eleven specific workflows that are now in production use, tested enough to trust, and teachable to new engineers.

Workflow 1: AI-Assisted RCA Analysis

Impact: Incident diagnosis 45 min → 8 min

The most operationally significant workflow we've built. The full pipeline:

5xx spike detected → agent searches codebase to identify the probable source → generates targeted database queries → queries run against MongoDB via MCP to confirm or rule out each hypothesis → confirmed cause → agent produces fix → PR opened for engineer review → /autofix-pr watches PR for CI failures and pushes corrections.

The step that makes this work is the MCP MongoDB connection. Without it, the agent could reason about code but couldn't verify whether the hypothesis matched actual production data. With it, the gap between "here's what I think is wrong" and "here's what the data shows" closes in the same session.

In practice: on-call engineer gets paged, pastes the error pattern and stack trace into the agent, runs the RCA skill. The agent searches the relevant modules (guided by the LLM Wiki), surfaces two or three candidate root causes ranked by likelihood, and emits a set of MongoDB queries for each hypothesis. Engineer reviews the query set — typically two minutes — approves the queries, the Mongo MCP tool executes them, the agent reads the results and narrows to a confirmed cause. Fix and PR follow.

The part we're careful about: the agent proposes the fix; a human reviews and merges. We haven't automated the merge. The RCA workflow compresses time-to-diagnosis and time-to-fix-proposal, not removing engineer judgment from the loop. Our most recent incident: diagnosis went from 45 minutes to under 8.

After merge, /autofix-pr keeps a Claude session watching the PR for CI failures or reviewer follow-ups, pushing corrections automatically. The on-call engineer can fully disengage after the initial fix merge.

Workflow 2: Scheduled Monitoring Routines

Impact: Proactive risk surface — catches CVEs, regressions, and compliance gaps before they become incidents

Beyond on-demand workflows, we run a set of scheduled Claude Code tasks that execute autonomously on a recurring cadence and post results to Slack or open GitHub issues when thresholds are exceeded.

Current scheduled routines:

Daily: Dependency vulnerability scan across both monorepos. Any new critical CVEs open a GitHub issue tagged by package and assigned to the owning team. This replaced a manual weekly check that was being skipped under sprint pressure.

Weekly: Stale branch cleanup report — branches older than 30 days with no activity, listed with last committer. Stale PRs older than 14 days with no reviewer activity, tagged for follow-up. Documentation staleness check — files referenced in CLAUDE.md that haven't been updated in 90+ days get flagged.

Nightly: Test coverage regression detection — if coverage on any package drops below the threshold, the owning team gets a Slack notification before the next standup. Performance benchmark regression — key API endpoints checked against baseline, outliers flagged before engineers start their morning.

Per-PR: License compliance check for any new dependency added. Catches copyleft licenses before they get merged into production code.

The implementation uses Claude Code's Routines feature — cloud-hosted scheduled tasks that run on Anthropic's infrastructure with full repo access. Each routine is a markdown file describing what to check and what to do when a threshold is exceeded. No custom infra required.

The shift this enabled: these checks were all happening manually and inconsistently before. Moving them to scheduled agents didn't change what we check — it changed whether we check consistently, and whether the findings surface before problems compound.

Workflow 3: AI-Assisted Code Review

Impact: ~28% reduction in PR cycle time; mechanical defects caught before human review

Code review is one of the highest-leverage places to deploy an agent because the context requirements are bounded and the failure modes are well-understood.

The mechanics: Claude Code Review dispatches four parallel review agents against every PR diff simultaneously. Each agent reads the full diff, the surrounding file context, and our tuned review rubric. Findings are confidence-scored — only issues at ≥80% confidence surface to the engineer, which keeps the signal-to-noise ratio high enough that developers actually read the output instead of dismissing it. Results post as inline comments directly on the affected lines in the PR, not as a separate report, so reviewers see the issue in context.

We maintain a REVIEW.md file alongside CLAUDE.md that specifies what to flag, what to ignore, and what backward-compatibility rules apply to our shared packages. The review triggers automatically on every push via the GitHub Actions integration. @claude review is available for manual re-runs when a specific re-check is needed.

What it doesn't replace: architectural judgment, product decisions, the knowledge a senior engineer carries about why a particular pattern exists. What it does replace: the cycle of a reviewer catching a missing null check, leaving a comment, the author fixing it, requesting re-review. Those mechanical catches happen before the human reviewer sees the PR. The human review conversation starts at a higher level.

PR cycle time dropped most on the teams that adopted this earliest. The effect was strongest on junior engineers — not because their code got worse, but because they got faster feedback on mechanical issues and arrived at human review with cleaner PRs.

Workflow 4: Playwright 4-Agent E2E Testing

Impact: Automated QA on every PR; test maintenance burden eliminated

A four-agent Playwright pipeline for end-to-end coverage across our customer-facing apps. Each agent has a distinct role:

Planner Agent — explores the application like a QA engineer, navigating flows and identifying user paths and edge cases, producing a markdown test plan. Feed it a URL and a scope ("the checkout flow") and it finds the paths worth testing.

Generator Agent — converts the plan into actual Playwright tests. Critically, it interacts with the live app during generation to verify every selector works before writing the test. Generated tests don't fail on first run because of stale locators.

Healer Agent — when a test breaks after a UI change, it replays the failing steps against the current UI, identifies what changed, and patches the locator or wait condition. Test maintenance stops being a manual burden.

Maintenance Agent — shows proposed changes as diffs and requires human approval before anything is applied. No silent rewrites. Test coverage changes are always human-visible.

The full pipeline runs via GitHub Actions on every PR. Claude Code + Playwright MCP acts as an automated QA engineer: clicks buttons, fills forms with adversarial inputs, resizes the viewport for mobile, and posts a detailed QA report as a PR comment with screenshots of failures.

Locator rules encoded in CLAUDE.md: data-testid over role over text over CSS. No sleep or waitForTimeout — use waitForSelector or waitForResponse. No shared state between tests. No production data. Without these rules, the agent defaults to brittle CSS selectors and sleep(2000) patterns that break on any UI change.

Workflow 5: Batch Large-Scale Refactoring

Impact: 200+ file migrations completed in hours instead of weeks; tech debt elimination at sprint speed

For large-scale changes that touch many files with a consistent pattern — migrating an import, renaming a method across a codebase, applying a new convention org-wide — we use /batch instead of a single session.

/batch plans the work, splits it across multiple background agents each working in an isolated git worktree, and coordinates the results. Each agent handles a slice of the file set independently. When all slices complete, the results are consolidated into a single diff for review.

We've used this on two migrations: the ESLint rule rollout across the backend monorepo (touching 200+ files to enforce the new import convention) and the TypeScript strict-mode enablement pass on the legacy frontend app. The alternative — a single Claude session touching 200 files sequentially — would hit context limits before finishing and produce increasingly inconsistent changes. Batch agents are stateless relative to each other, which means the 200th file gets the same quality treatment as the 5th.

The constraint: batch works cleanly for changes with a clear, consistent pattern. It doesn't work for changes that require understanding the broader codebase state across files simultaneously. The LLM Wiki (Workflow 6) is the right prerequisite for that kind of work.

Workflow 6: Multi-Agent Parallel Feature Development

Impact: 4 features shipped in parallel instead of sequentially; lead engineers unblocked from single-thread waiting

For work that can be decomposed into independent pieces, we run multiple Claude Code instances simultaneously in isolated git worktrees — each working on a different feature or task, with zero risk of stepping on each other's local edits.

The setup: git worktree add .worktrees/feature-a feature-branch-a creates an isolated workspace on its own branch. A second git worktree add .worktrees/feature-b feature-branch-b creates another. Each gets its own Claude Code session in a separate terminal tab. The main branch is untouched until each feature is ready to merge.

In practice, this means a lead engineer can queue four features across four terminal sessions, then context-switch between them as each agent reaches a decision point that requires human input — rather than waiting on one agent to finish before the next starts. Some runs get abandoned because the approach doesn't pan out; that's expected and cheap in this model. The parallel structure means one dead-end doesn't block three other features.

The rule we enforce: each worktree session gets only the CLAUDE.md rules and context relevant to its task. Overloading a session with the full org context creates the same context rot problem as an individual over-extended session.

Workflow 7: LLM Wiki / Graphify

Impact: 71.5× fewer tokens per agentic session; faster onboarding, better AI output quality on legacy code

Before running any agentic task on a legacy codebase, we point the agent at a human-curated wiki layer describing the codebase's module structure — instead of at the raw source code.

The wiki contains: a module map (what each package/directory does in one sentence), a data flow diagram of the critical request paths, a list of known footguns and workarounds with their reasons, migration status for each module (legacy / in-progress / migrated), and ownership per module. It's 300–500 lines of structured markdown, updated whenever an engineer discovers something new about how the codebase actually works.

The result: 71.5× fewer tokens consumed by the agent when orienting itself before a task. Instead of spending the first 80,000 tokens reading through source files trying to understand the module structure, the agent reads the wiki (3,000 tokens), understands the landscape, and spends the remaining context budget on the actual work.

The Graphify part refers to generating a visual dependency graph from the wiki — we pipe the module relationships through a graph tool to produce an SVG dependency map that both engineers and agents can reference. When an agent is about to touch a module, it checks the graph to understand what it's connected to and which changes will have transitive effects.

This workflow is now a prerequisite for all agentic work on the legacy frontend app and the payments platform. Starting a session without the wiki is like giving the agent a map — starting with it is like giving it GPS.

Workflow 8: Subagent-Driven Codebase Research

Impact: New engineers productive on unfamiliar modules in hours instead of days; senior engineers off archaeology work

When an engineer needs to understand a complex, unfamiliar part of the codebase — before making a change, debugging an incident, or onboarding to a module — we spawn a dedicated research subagent instead of using the main session.

The pattern: "Use a subagent to trace how the payment flow handles failed transactions — follow it from the API route through the service layer to the database and the third-party gateway. Map every file involved and document the error handling at each step." The subagent gets its own context window, reads as many files as needed, and returns a structured summary. The main session receives the summary without having consumed thousands of tokens on file reads.

The reason this matters: the main session context is finite and precious. Using it for discovery work means less capacity for the actual implementation. A subagent's file reads don't pollute the main window. The engineer can ask five subagents to research five different parts of the codebase in parallel, then synthesize the findings into a plan in the main session with the full context budget intact.

We use this particularly heavily on the legacy PHP backend → NestJS migration work, where understanding the implicit dependencies in legacy code before touching it is often more than half the job.

Workflow 9: Figma MCP Design-to-Code

Impact: Design iteration cycle days → minutes; designer changes ship without re-implementation

The Figma MCP integration allows an agent to read directly from a Figma design file and produce production-ready component code. No screenshot → manual implementation cycle.

The setup: engineer adds the Figma MCP server to their Claude Code config, which gives the agent authenticated read access to any Figma file in the org. When a designer finishes a spec, the engineer connects the agent to the file URL, describes which component to build and which design tokens to use, and the agent reads the design properties — colors, spacing, typography, layer hierarchy — and produces the implementation.

We've shipped this on two production projects: the Roles page and the B2B home page. What actually saves time isn't the first-pass generation — it's iteration speed. When the designer updates the spec, re-running the agent against the updated Figma file takes minutes. Previously, design changes meant a manual back-and-forth that stretched over days. The workflow removes the translation labor, not the engineering judgment — accessibility, interactive states, performance, and edge cases all still require human review before merge.

One constraint we enforce in CLAUDE.md: the agent targets design tokens from our shared package, not hardcoded hex values. Without this rule, the agent uses whatever color it reads from Figma, which defeats the design system.

Workflow 10: AI-Assisted Merge Conflict Resolution

Impact: Multi-file conflicts resolved consistently in one pass; semantic errors from sequential resolution eliminated

Merge conflict resolution feels straightforward but isn't — especially when conflicts span multiple files with different intents behind each change. Human resolution is sequential: you look at one conflict, resolve it, move to the next. When a conflict in one file has implications for a conflict in another, you miss the connection.

Agents hold all conflicting changes in context at once and reason about consistency across the full set. The workflow: run git diff --merge to extract all conflict markers, paste the output to the agent, and write one sentence describing the intent of each branch — what was each engineer trying to accomplish? The agent reads all conflicts together, proposes resolutions that are internally consistent, and explains the reasoning for each. Engineer reviews and accepts or overrides.

The sentences matter more than the diff. "Branch A was adding rate limiting to the payment handler, Branch B was refactoring the same handler for the NestJS migration" gives the agent what it needs to judge which changes are load-bearing and which can yield. Without that context, the agent resolves conflicts syntactically but not semantically.

We run this on a dedicated worktree — so the resolution work is isolated and we can discard it cleanly if the agent's proposal needs significant overrides.

Workflow 11: AI-Assisted Cherry-Pick Branch Split

Impact: Combined-work branches deployed independently; migration and bug-fix work decoupled without re-implementation

The scenario: an engineer has a branch with combined work — two features developed together, or a feature plus a bug fix touching the same files. We need to deploy them separately. The manual approach is re-implementing one set of changes from scratch on a fresh branch, which is tedious and error-prone.

The agent workflow:

code

git diff main..feature-branch > full.diff

Paste the full diff to the agent with two one-paragraph descriptions: what Feature A is trying to accomplish, and what Feature B is trying to accomplish. The agent analyzes every changed file and produces two clean patch sets — one for each intent. Each patch set contains only the changes relevant to that feature, with shared-file changes split correctly between them. Apply each to a fresh branch with git apply, verify with the test suite, and open separate PRs.

This comes up more often than expected during active migration periods. An engineer migrating a payment handler to NestJS shouldn't hold up a bug fix in the same handler just because they touched the same file. The cherry-pick split workflow eliminates that coupling.

The constraint: the agent's split is a proposal, not a final answer. Before opening PRs, the engineer runs the test suite on each split branch and verifies the application behavior independently. The agent is reliable at the mechanical split; human review catches cases where the two intents genuinely share a dependency that can't be separated cleanly.

Rolling Out to 50+

After the pilot, we had enough data to make a decision. Data wasn't the hard part of rollout. Culture was.

The skeptics on the full 50-person team had two real objections — not the ones they said out loud:

"This will make junior engineers worse." The fear that if juniors outsource thinking to AI, they'll stop developing judgment. This is legitimate. Our answer: AI tools don't replace the code review process, they add a step before it. A junior engineer submitting AI-generated code they can't explain is a code review failure, not an AI failure.

"My job is being automated." Unspoken but real. We addressed it directly in an all-hands: the engineers who will be most valuable in an agentic world are the ones who deeply understand what they're building. AI amplifies clear thinking. It exposes muddled thinking. If you know your domain, these tools make you faster. If you don't, they make the gap more visible.

The rollout:

Weeks 1–2: Mandatory 1-hour onboarding session per team (not optional, not async-only)
Weeks 3–4: Leads ran one live "pair prompting" session with their team — debugging or feature slicing with the tool on screen
Month 2: Retrospective by team, shared across the org

We standardized on Claude Code for terminal/agentic workflows and Cursor for in-editor assistance, but we didn't mandate which to use for any given task. The point was fluency, not uniformity.

CLAUDE.md Standardization

One of the highest-leverage moves in the rollout was treating CLAUDE.md as a team artifact, not a personal one. CLAUDE.md is the project-level file Claude reads at the start of every session — your persistent instructions for architecture, conventions, forbidden patterns, and how to run tests. Plan Mode output quality is directly proportional to how good your CLAUDE.md is.

We built a shared template and required every project to start from it. What goes in: tech stack and repo structure, exact build/test/lint commands, hard rules (no pushing secrets, no rm -rf, preferred patterns). What stays out: style rules (that's what a linter is for), stale code snippets, and anything that reads like a generic "write clean code" instruction. Target length is under 200 lines — past that, Claude becomes less reliable at following the rules buried at the bottom.

When Claude makes a mistake that's worth preventing in future sessions, we don't fix the CLAUDE.md manually. We tell Claude: "Update the rules so we never do this again." This creates a self-correcting feedback loop over time.

Security Hooks at Scale

With 50+ engineers on the same toolchain, the risk isn't malice — it's accidents. Claude Code runs with full user-level privileges. Before rolling out org-wide, we deployed PreToolUse hooks that block a specific set of commands regardless of what Claude decides in the moment: rm -rf /, force-push to main, sudo, curl-pipe-to-bash, and reading .env files via bash. Hooks fire even in bypass-permissions mode, which makes them the actual enforcement layer, not just a soft request.

We also set up deny rules in managed settings to block reading secrets and production credentials. The rule of thumb we use: treat Claude like a highly capable contractor who has never worked in this environment before. Give it minimum necessary permissions, verify before the session what those permissions are, and audit what it actually does.

What Changed (And What Didn't)

Six months later:

PR cycle time dropped ~28% on average — mostly on the backend team where boilerplate was heaviest. Frontend saw less improvement because our component patterns were already tight.

Onboarding time for new engineers on unfamiliar modules dropped significantly. Being able to ask "explain this file's role in the payment flow" and get a coherent answer in context is genuinely useful for someone new to a codebase they didn't build.

Senior engineers spend less time on "spelunking" — navigating legacy code to understand context before making a change. That time shifted to design review and code review, where senior judgment matters most.

What didn't change:

Architecture decisions still take the same amount of time. You can't prompt your way to a good system design. The tools are fast at the leaf nodes; the hard work is in the structure.

Code review thoroughness stayed constant. If anything, reviewers got more rigorous because AI-generated code can be confidently incorrect in ways that human-written code rarely is.

The Hard Parts Nobody Talks About

Context management is a skill. The difference between a 200-line useful output and a 200-line hallucinated mess is whether you gave the tool the right context upfront, and whether the session was still within its effective operating range when you asked. Past 100K–120K tokens, Claude starts dropping instructions and making inconsistent decisions — not failing noisily, just quietly getting worse. Engineers who don't know about this threshold blame the model. The model isn't the problem; the session management is. We now teach context hygiene explicitly in onboarding alongside prompt writing.

The tools expose your documentation debt. When you ask an agent to understand a module and it gives you garbage, it's usually because the module has no comments, no clear interface, and was written by someone who left two years ago. The AI didn't fail — your documentation did. We used the rollout as a forcing function to start documenting the worst offenders.

Prompting varies by seniority in unexpected ways. Junior engineers were often better at narrow, well-defined tasks. Senior engineers were better at open-ended exploratory use. But senior engineers also had the highest variance — the ones who over-trusted the output produced some of our worst AI-assisted PRs.

Vendor lock-in anxiety is real. At least three leads asked "what happens if this tool gets worse or the pricing changes?" We don't have a great answer. We chose tools with API-level access where possible, but it's a legitimate concern we're watching.

Where We Are

The engineers who were skeptical in month one are now the loudest advocates — specifically because they pushed back early, ran real experiments, and built genuine fluency rather than just vibing with the tool.

The org-wide rollout is still in progress. The blocker isn't skepticism — the results speak clearly enough that interest is high. The current blocker is Gemini team alignment: several other engineering teams have Google tooling dependencies and want to understand how Gemini fits before committing to a Claude-primary stack org-wide. This is a reasonable concern, not resistance.

The one thing we'd change if starting over: begin the documentation cleanup earlier. The AI tooling exposed gaps we'd been ignoring for years. That cleanup took longer than the rollout itself.

If you're an engineering lead starting this process: the evaluation is worth doing properly, the cost math is more favorable than it looks at first, and the eleven workflows above are solid starting points regardless of which vendor you choose. The specific tooling matters less than having validated, documented, team-level workflows that everyone understands and trusts.

The direction we're moving toward is deliberately multi-vendor: Claude Code for complex, quality-critical work; Codex for async CI tasks and batch operations we want to run in the background; Gemini for exploratory work and budget-conscious engineers who want 1M context without a seat cost. The three tools are cross-compatible enough — shared MCP servers, a universal AGENTS.md format, the same SKILL.md convention — that running all three doesn't mean three different knowledge bases. It means the right tool for the right task, without locking the team into a single vendor's pricing and reliability trajectory.

// RELATED

01APPLIED AI

SEP 2024 · 4 MIN→

We Put CrewAI Agents in Our Support Pipeline. Here's What Production Actually Taught Us.

CREWAIMULTI-AGENTPYTHONCUSTOMER-SERVICE