The Architecture Tax: AI Coding Assistants Demand Better Planning

Something peculiar happened when software development teams started delegating code generation to AI assistants. The traditional burden of implementation, that painstaking process of translating designs into working software, began shifting elsewhere. But it did not disappear. Instead, it transformed into something altogether different: an intensified requirement for architectural rigour that many teams were unprepared to provide.

In early 2025, a randomised controlled trial conducted by METR examined how AI tools affect the productivity of experienced open-source developers. Sixteen developers with moderate AI experience completed 246 tasks in mature projects on which they had an average of five years of prior experience. Each task was randomly assigned to allow or disallow usage of early 2025 AI tools. The finding shocked the industry: developers using AI tools took 19% longer to complete tasks than those working without them. Before starting, developers had forecast that AI would reduce their completion time by 24%. Even after finishing the study, participants still believed AI had made them faster, despite the data proving otherwise.

This perception gap reveals something fundamental about the current state of AI-assisted development. The tools are genuinely powerful, but their power comes with hidden costs that manifest as architectural drift, context exhaustion, and what practitioners have come to call the “zig-zag problem”: the iterative back-and-forth that emerges when teams dive into implementation without sufficient upfront specification.

The Great Delegation

The scale of AI adoption in software development has been nothing short of revolutionary. By March 2025, Y Combinator reported that 25% of startups in its Winter 2025 batch had codebases that were 95% AI-generated. These were not weekend projects built by hobbyists. These were venture-backed companies building production systems, with the cohort growing 10% per week in aggregate, making it the fastest-growing batch in YC history.

As CEO Garry Tan explained, the implications were profound: teams no longer needed fifty or a hundred engineers. They did not have to raise as much capital. The money went further. Companies like Red Barn Robotics developed AI-driven agricultural robots securing millions in contracts. Deepnight built military-grade night vision software for the US Army. Delve launched with over 100 customers and a multi-million pound run rate, all with remarkably lean teams.

Jared Friedman, YC's managing partner, emphasised a crucial point about these companies: “It's not like we funded a bunch of non-technical founders. Every one of these people is highly technical, completely capable of building their own products from scratch. A year ago, they would have built their product from scratch, but now 95% of it is built by an AI.”

Yet beneath these success stories lurked a more complicated reality. Pete Hodgson, writing about AI coding assistants in May 2025, captured the core problem with devastating clarity: “The state of the art with coding agents in 2025 is that every time you start a new chat session, your agent is reset to the same knowledge as a brand new hire, one who has carefully read through all the onboarding material and is good at searching through the codebase for context.”

This “brand new hire” phenomenon explains why architectural planning has become so critical. Traditional developers build mental models of codebases over months and years. They internalise team conventions, understand why certain patterns exist, and recognise the historical context behind architectural decisions. AI assistants possess none of this institutional memory. They approach each session with technical competence but zero contextual awareness.

The burden that has shifted is not the mechanical act of writing code. It is the responsibility for ensuring that generated code fits coherently within existing systems, adheres to established patterns, and serves long-term maintainability rather than short-term convenience.

Context Windows and the Memory Problem

To understand why architectural planning matters more with AI assistants, you must first understand how these systems process information. Every AI model operates within what engineers call a context window: the total amount of text it can consider simultaneously. By late 2025, leading models routinely supported 200,000 tokens or more, with some reaching one million tokens. Google's Gemini models offered input windows of over a million tokens, enough to analyse entire books or multi-file repositories in a single session.

But raw capacity tells only part of the story. Timothy Biondollo, writing about the fundamental limitations of AI coding assistants, articulated what he calls the Principle of Compounding Contextual Error: “If an AI interaction does not resolve the problem quickly, the likelihood of successful resolution drops with each additional interaction.”

The mechanics are straightforward but devastating. As you pile on error messages, stack traces, and correction prompts, you fill the context window with what amounts to garbage data. The model is reading a history full of its own mistakes, which biases it toward repeating them. A long, winding debugging session is often counterproductive. Instead of fixing the bug, you are frequently better off resetting the context and starting fresh with a refined prompt.

This dynamic fundamentally changes how teams must approach complex development tasks. With human developers, extended debugging sessions can be productive because humans learn from their mistakes within a session. They build understanding incrementally. AI assistants do the opposite: their performance degrades as sessions extend because their context becomes polluted with failed attempts.

The practical implication is that teams cannot rely on AI assistants to self-correct through iteration. The tools lack the metacognitive capacity to recognise when they are heading down unproductive paths. They will cheerfully continue generating variations of flawed solutions until the context window fills with a history of failures, at which point the quality of suggestions deteriorates further.

Predictions from industry analysts suggest that one million or more tokens will become standard for flagship models in 2025 and 2026, with ten million token contexts emerging in specialised models by 2027. True “infinite context” solutions may arrive in production by 2028. Yet even with these expansions, the fundamental challenge remains: more tokens do not eliminate the problem of context pollution. They merely delay its onset.

The Specification Renaissance

This context limitation has driven a renaissance in software specification practices. What the industry has come to call spec-driven development represents one of 2025's most significant methodological shifts, though it lacks the visibility of trendier terms like vibe coding.

Thoughtworks describes spec-driven development as a paradigm that uses well-crafted software requirement specifications as prompts for AI coding agents to generate executable code. The approach explicitly separates requirements analysis from implementation, formalising requirements into structured documents before any code generation begins.

GitHub released Spec Kit, an open-source toolkit that provides templates and workflows for this approach. The framework structures development through four distinct phases: Specify, Plan, Tasks, and Implement. Each phase produces specific artifacts that carry forward to subsequent stages.

In the Specify phase, developers capture user journeys and desired outcomes. As the Spec Kit documentation emphasises, this is not about technical stacks or application design. It focuses on experiences and what success looks like: who will use the system, what problem it solves, how users will interact with it, and what outcomes matter. This specification becomes a living artifact that evolves as teams learn more about users and their needs.

The Plan phase gets technical. Developers encode their desired stack, architecture, and constraints. If an organisation standardises on certain technologies, this is where those requirements become explicit. The plan captures compliance requirements, performance targets, and security policies that will guide implementation.

The Tasks phase breaks specifications into focused, reviewable work units. Each task solves a specific piece of the puzzle and enables isolated testing and validation. Rather than asking an AI to generate an entire feature, developers decompose work into atomic units that can be independently verified.

Only in the Implement phase do AI agents begin generating code, now guided by clear specifications and plans rather than vague prompts. The approach transforms fuzzy intent into unambiguous instructions that language models can reliably execute.

Planning Artifacts That Actually Work

Not all specification documents prove equally effective at guiding AI assistants. Through extensive experimentation, the industry has converged on several artifact types that demonstrably reduce architectural drift.

The spec.md file has emerged as foundational. Addy Osmani, Chrome engineering lead at Google, recommends creating a comprehensive specification document containing requirements, architecture decisions, data models, and testing strategy. This document forms the basis for development, providing complete context before any code generation begins. Osmani describes the approach as doing “waterfall in fifteen minutes” through collaborative specification refinement with the AI before any code generation occurs.

Tasks.md serves a complementary function, breaking work into incremental, testable steps with validation criteria. Rather than jumping straight into code, this process establishes intent first. The AI assistant then uses these documents as context for generation, ensuring each piece of work connects coherently to the larger whole.

Plan.md captures the technical approach: a short overview of the goal, the main steps or phases required to achieve it, and any dependencies, risks, or considerations to keep in mind. This document bridges the gap between what the system should do and how it should be built.

Perhaps most critically, the CLAUDE.md file (or equivalent for other AI tools) has become what practitioners call the agent's constitution, its primary source of truth for how a specific repository works. HumanLayer, a company building tooling for AI development workflows, recommends keeping this file under sixty lines. The general consensus is that less than three hundred lines works best, with shorter being even better.

The rationale for brevity is counterintuitive but essential. Since CLAUDE.md content gets injected into every single session, bloated files consume precious context window space that should be reserved for task-specific information. The document should contain universally applicable information: core application features, technology stacks, and project notes that should never be forgotten. Anthropic's own guidance emphasises preferring pointers to copies: rather than including code snippets that will become outdated, include file and line references that point the assistant to authoritative context.

Architecture Decision Records in the AI Era

A particularly interesting development involves the application of Architecture Decision Records to AI-assisted development. Doug Todd has demonstrated transformative results using ADRs with Claude and Claude Code, showing how these documents provide exactly the kind of structured context that AI assistants need.

ADRs provide enough structure to ensure key points are addressed, but express that structure in natural language, which is perfect for large language model consumption. They capture not just what was decided, but why, recording the context, options considered, and reasoning behind architectural choices.

Chris Swan, writing about this approach, notes that ADRs might currently be an elite team practice, but they are becoming part of a boilerplate approach to working with AI coding assistants. This becomes increasingly important as teams shift to agent swarm approaches, where they are effectively managing teams of AI workers, exactly the sort of environment that ADRs were originally created for.

The transformation begins when teams stop thinking of ADRs as documentation and start treating them as executable specifications for both human and AI behaviour. Every ADR includes structured metadata and clear instructions that AI assistants can parse and apply immediately. Accepted decisions become mandatory requirements. Proposed decisions become considerations. Deprecated and superseded decisions trigger active avoidance patterns.

Dave Patten describes using AI agents to enforce architectural standards, noting that LLMs and autonomous agents are being embedded in modern pipelines to enforce architectural principles. The goal is not perfection but catching drift early before it becomes systemic.

ADR rot poses a continuing challenge. It does not happen overnight. At first, everything looks healthy: the repository is clean, decisions feel current, and engineers actually reference ADRs. Then reality sets in. Teams ship features, refactor services, migrate infrastructure, and retire old systems. If no one tends the ADR log, it quietly drifts out of sync with the system. Once that happens, engineers stop trusting it. The AI assistant, fed outdated context, produces code that reflects decisions the team has already moved past.

The Zig-Zag Problem

Without these planning artifacts, teams inevitably encounter what practitioners call the zig-zag problem: iterative back-and-forth that wastes cycles and produces inconsistent results. One developer who leaned heavily on AI generation for a rushed project described the outcome as “an inconsistent mess, duplicate logic, mismatched method names, no coherent architecture.” He realised he had been “building, building, building” without stepping back to see what the AI had woven together. The fix required painful refactoring.

The zig-zag emerges from a fundamental mismatch between how humans and AI assistants approach problem-solving. Human developers naturally maintain mental models that constrain their solutions. They remember what they tried before, understand why certain approaches failed, and build incrementally toward coherent systems.

AI assistants lack this continuity. Each response optimises for the immediate prompt without consideration of the larger trajectory. Ask for a feature and you will get a feature, but that feature may duplicate existing functionality, violate established patterns, or introduce dependencies that conflict with architectural principles.

Qodo's research on AI code quality found that about a third of developers verify AI code more quickly than writing it from scratch, whilst the remaining two-thirds require more time for verification. Roughly a fifth face heavy overruns of 50 to 100 percent or more, making verification the bottleneck. Approximately 11 percent of developers reported code verification taking much longer, with many code mismatches requiring deep rework.

The solution lies in constraining the problem space before engaging AI assistance. Hodgson identifies three key strategies: constrain the problem by being more directive in prompts and specifying exact approaches; provide missing context by expanding prompts with specific details about team conventions and technical choices; and enable tool-based context discovery through integrations that give AI access to schemas, documentation, and requirements.

Structuring Handoffs Between Planning and Implementation

The transition from planning to implementation represents a critical handoff that many teams execute poorly. GitHub's Spec Kit documentation emphasises that specifications should include everything a developer, or an AI agent, needs to know to start building: the problem, the approach, required components, validation criteria, and a checklist for handoff. By following a standard, the transition from planning to doing becomes clear and predictable.

This handoff structure differs fundamentally from traditional agile workflows. In conventional development, a user story might contain just enough information for a human developer to ask clarifying questions and fill in gaps through conversation. AI assistants cannot engage in this kind of collaborative refinement. They interpret prompts literally and generate solutions based on whatever context they possess.

The Thoughtworks analysis of spec-driven development emphasises that AI coding agents receive finalised specifications along with predefined constraints via rules files or agent configuration documents. The workflow emphasises context engineering: carefully curating information for agent-LLM interaction, including real-time documentation integration through protocols that give assistants access to external knowledge sources.

Critically, this approach does not represent a return to waterfall methodology. Spec-driven development creates shorter feedback cycles than traditional waterfall's excessively long ones. The specification phase might take minutes rather than weeks. The key difference is that it happens before implementation rather than alongside it.

Microsoft's approach to agentic AI explicitly addresses handoff friction. Their tools bridge the gap between design and development, eliminating time-consuming handoff processes. Designers iterate in their preferred tools whilst developers focus on business logic and functionality, with the agent handling implementation details. Teams now receive notifications that issues are detected, analysed, fixed, and documented, all without human intervention. The agent creates issues with complete details so teams can review what happened and consider longer-term solutions during regular working hours.

The practical workflow involves marking progress and requiring the AI agent to update task tracking documents with checkmarks or completion notes. This gives visibility into what is done and what remains. Reviews happen after each phase: before moving to the next set of tasks, teams review code changes, run tests, and confirm correctness.

The Self-Correction Illusion

Perhaps the most dangerous misconception about AI coding assistants is that they can self-correct through iteration. The METR study's finding that developers take 19% longer with AI tools, despite perceiving themselves as faster, points to a fundamental misunderstanding of how these tools operate.

The problem intensifies in extended sessions. When you see auto-compacting messages during a long coding session, quality drops. Responses become vaguer. What was once a capable coding partner becomes noticeably less effective. This degradation occurs because compaction loses information. The more compaction happens, the vaguer everything becomes. Long coding sessions feel like they degrade over time because you are literally watching the AI forget.

Instead of attempting marathon sessions where you expect the AI to learn and improve, effective workflows embrace a different approach: stop trying to do everything in one session. For projects spanning multiple sessions, implementing comprehensive logging and documentation serves as external memory. Documentation becomes the only bridge between sessions, requiring teams to write down everything needed to resume work effectively whilst minimising prose to conserve context window space.

Anthropic's September 2025 announcement of new context management capabilities represented a systematic approach to this problem. The introduction of context editing and memory tools enabled agents to complete workflows that would otherwise fail due to context exhaustion, whilst reducing token consumption by 84 percent in testing. In a 100-turn web search evaluation, context editing enabled agents to complete workflows that would otherwise fail due to context exhaustion.

The recommended practice involves dividing and conquering with sub-agents: modularising large objectives and delegating API research, security review, or feature planning to specialised sub-agents. Each sub-agent gets its own context window, preventing any single session from approaching limits. Telling the assistant to use sub-agents to verify details or investigate particular questions, especially early in a conversation or task, tends to preserve context availability without much downside in terms of lost efficiency.

Extended thinking modes also help. Anthropic recommends using specific phrases to trigger additional computation time: “think” triggers basic extended thinking, whilst “think hard,” “think harder,” and “ultrathink” map to increasing levels of thinking budget. These modes give the model additional time to evaluate alternatives more thoroughly, reducing the need for iterative correction.

Practical Limits of AI Self-Correction

Understanding the practical boundaries of AI self-correction helps teams design appropriate workflows. Several patterns consistently cause problems.

Open solution spaces present the first major limitation. When problems have multiple valid solutions, it is extremely unlikely that an AI will choose the right one without explicit guidance. The AI assistant makes design decisions at the level of a fairly junior engineer and lacks the experience to challenge requirements or suggest alternatives.

Implicit knowledge creates another barrier. The AI has no awareness of your team's conventions, preferred libraries, business context, or historical decisions. Every session starts fresh, requiring explicit provision of context that human team members carry implicitly. Anthropic's own research emphasises that Claude is already smart enough. Intelligence is not the bottleneck; context is. Every organisation has its own workflows, standards, and knowledge systems, and the assistant does not inherently know any of these.

Compound errors represent a third limitation. Once an AI starts down a wrong path, subsequent suggestions build on that flawed foundation. Without human intervention to recognise and redirect, entire implementation approaches can go astray.

The solution is not more iteration but better specification. Teams seeing meaningful results treat context as an engineering surface, determining what should be visible to the agent, when, and in what form. More information is not always better. AI can be more effective when further abstracted from the underlying system because the solution space becomes wider, allowing better leverage of generative and creative capabilities.

The Rules File Ecosystem

The tooling ecosystem has evolved to support these context management requirements. Cursor, one of the most popular AI coding environments, has developed an elaborate rules system. Large language models do not retain memory between completions, so rules provide persistent, reusable context at the prompt level. When applied, rule contents are included at the start of the model context, giving the AI consistent guidance for generating code.

The system distinguishes between project rules, stored in the .cursor/rules directory and version-controlled with the codebase, and global rules that apply across all projects. Project rules encode domain-specific knowledge, standardise patterns, and automate project workflows. They can be scoped using path patterns, invoked manually, or included based on relevance.

The legacy .cursorrules file has been deprecated in favour of individual .mdc files inside the .cursor/rules/ directory. This change provides better organisation, easier updates, and more focused rule management. Each rule lives in its own file with the .mdc (Markdown Components) extension, allowing for both metadata in frontmatter and rule content in the body.

The critical insight for 2025 is setting up what practitioners call the quartet: Model Context Protocol servers, rules files, memories, and auto-run configurations at the start of projects. This reduces token usage by only activating relevant rules when needed, giving the language model more mental space to focus on specific tasks rather than remembering irrelevant guidelines.

Skills represent another evolution: organised folders of instructions, scripts, and resources that AI assistants can dynamically discover and load. These function as professional knowledge packs that raise the quality and consistency of outputs across entire organisations.

Code Quality and the Verification Burden

The shift in architectural burden comes with significant implications for code quality. A landmark Veracode study in 2025 analysed over 100 large language models across 80 coding tasks and found that 45 percent of AI-generated code introduces security vulnerabilities. These were not minor bugs but critical flaws, including those in the OWASP Top 10.

In March 2025, a vibe-coded payment gateway approved over 1.5 million pounds in fraudulent transactions due to inadequate input validation. The AI had copied insecure patterns from its training data, creating a vulnerability that human developers would have caught during review.

Technical debt compounds the problem. Over 40 percent of junior developers admitted to deploying AI-generated code they did not fully understand. AI-generated code tends to include 2.4 times more abstraction layers than human developers would implement for equivalent tasks, leading to unnecessary complexity. Forrester forecast an incoming technical debt tsunami over the following two years due to advanced AI coding agents.

The verification burden has shifted substantially. Where implementation was once the bottleneck, review now consumes disproportionate resources. Code review times ballooned by approximately 91 percent in teams with high AI usage. The human approval loop became the chokepoint.

Teams with strong code review processes experience quality improvements when using AI tools, whilst those without see quality decline. This amplification effect makes thoughtful implementation essential. The solution involves treating AI-generated code as untrusted by default. Every piece of generated code should pass through the same quality gates as human-written code: automated testing, security scanning, code review, and architectural assessment.

The Team Structure Question

These dynamics have implications for how development teams should be structured. The concern that senior developers will spend their time training AI instead of training junior developers is real and significant. Some organisations report that senior developers became more adept at leveraging AI whilst spending less time mentoring, potentially creating future talent gaps.

Effective teams structure practices to preserve learning opportunities. Pair programming sessions include AI as a third participant rather than a replacement for human pairing. Code review processes use AI-generated code as teaching opportunities. Architectural discussions explicitly evaluate AI suggestions against alternatives.

Research on pair programming shows that two sets of eyes catch mistakes early, with studies finding pair-programmed code has up to 15 percent fewer defects. A meta-analysis found pairs typically consider more design alternatives than programmers working alone, arrive at simpler and more maintainable designs, and catch design defects earlier. Teams are adapting this practice: one developer interacts with the AI whilst another reviews the generated code and guides the conversation, creating three-way collaboration that preserves learning benefits.

The skill set required for effective AI collaboration differs from traditional development. Where implementation expertise once dominated, context engineering has become equally important. The most effective developers of 2025 are still those who write great code, but they increasingly augment that skill by mastering the art of providing persistent, high-quality context.

Surveying the Transformed Landscape

The architectural planning burden that has shifted to human developers represents a permanent change in how software gets built. AI assistants will continue improving, context windows will expand, and tooling will mature. But the fundamental requirement for clear specifications, structured context, and human oversight will remain.

Microsoft's chief product officer for AI, Aparna Chennapragada, sees 2026 as a new era for alliances between technology and people. If recent years were about AI answering questions and reasoning through problems, the next wave will be about true collaboration. The future is not about replacing humans but about amplifying them. GitHub's chief product officer, Mario Rodriguez, predicts repository intelligence: AI that understands not just lines of code but the relationships and history behind them.

By 2030, all IT work is forecast to involve AI, with CIOs predicting 75 percent will be human-AI collaboration and 25 percent fully autonomous AI tasks. A survey of over 700 CIOs indicates that by 2030, none of the IT workload will be performed solely by humans. Software engineering will be less about writing code and more about orchestrating intelligent systems. Engineers who adapt to these changes, embracing AI collaboration, focusing on design thinking, and staying curious about emerging technologies, will thrive.

The teams succeeding at this transition share common characteristics. They invest in planning artifacts before implementation begins. They maintain clear specifications that constrain AI behaviour. They structure reviews and handoffs deliberately. They recognise that AI assistants are powerful but require constant guidance.

The zig-zagging that emerges from insufficient upfront specification is not a bug in the AI but a feature of how these tools operate. They excel at generating solutions within well-defined problem spaces. They struggle when asked to infer constraints that have not been made explicit.

The architecture tax is real, and teams that refuse to pay it will find themselves trapped in cycles of generation and revision that consume more time than traditional development ever did. But teams that embrace the new planning requirements, that treat specification as engineering rather than documentation, will discover capabilities that fundamentally change what small groups of developers can accomplish.

The future of software development is not about choosing between human expertise and AI capability. It is about recognising that AI amplifies whatever approach teams bring to it. Disciplined teams with clear architectures get better results. Teams that rely on iteration and improvisation get the zig-zag.

The planning burden has shifted. The question is whether teams will rise to meet it.


References and Sources

  1. METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” (July 2025)
  2. Y Combinator, reported in TechCrunch, “A Quarter of Startups in YC's Current Cohort Have Codebases That Are Almost Entirely AI-Generated” (March 2025)
  3. Pete Hodgson, “Why Your AI Coding Assistant Keeps Doing It Wrong, and How To Fix It” (May 2025)
  4. Thoughtworks, “Spec-driven development: Unpacking one of 2025's key new AI-assisted engineering practices” (2025)
  5. GitHub Blog, “Spec-driven development with AI: Get started with a new open source toolkit” (2025)
  6. Addy Osmani, “My LLM coding workflow going into 2026” (December 2025)
  7. Timothy Biondollo, “How I Solved the Biggest Problem with AI Coding Assistants” (Medium, 2025)
  8. HumanLayer Blog, “Writing a good CLAUDE.md” (2025)
  9. Chris Swan's Weblog, “Using Architecture Decision Records (ADRs) with AI coding assistants” (July 2025)
  10. Dave Patten, “Using AI Agents to Enforce Architectural Standards” (Medium, 2025)
  11. Qodo, “State of AI code quality in 2025” (2025)
  12. Veracode, AI Code Security Study (2025)
  13. Anthropic, “Claude Code: Best practices for agentic coding” (2025)
  14. Anthropic, “Effective context engineering for AI agents” (2025)
  15. Cursor Documentation, “Rules for AI” (2025)
  16. MIT Technology Review, “From vibe coding to context engineering: 2025 in software development” (November 2025)
  17. Microsoft, “What's next in AI: 7 trends to watch in 2026” (2025)
  18. IT Brief, “CIOs forecast all IT work will involve AI-human collaboration by 2030” (2025)
  19. Stack Overflow, “2025 Developer Survey” (2025)
  20. Red Hat Developer, “How spec-driven development improves AI coding quality” (October 2025)

Tim Green

Tim Green UK-based Systems Theorist & Independent Technology Writer

Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.

His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.

ORCID: 0009-0002-0156-9795 Email: tim@smarterarticles.co.uk

Discuss...