Making coding agents (Claude Code, Codex, etc.) reliable

The promise

Point an agent at a ticket. Come back to a working merged PR.

That’s the pitch every engineering team is hearing right now. Tools like Claude Code, Cursor, Windsurf, and GitHub Copilot keep getting better at generating code. The demos are impressive. The benchmarks keep climbing. And your timeline is full of people showing off AI-written features shipping to production.

But here’s what actually happens

You point an agent at a ticket. It reads the codebase, picks a direction, and starts writing code. Twenty minutes later, you’ve got a PR with 400 lines of changes. It looks reasonable. The syntax is correct. The logic seems plausible. Then you start looking closer. The code doesn’t match your team’s conventions. It makes assumptions about how a service behaves that haven’t been true since the last refactor. It passes the tests that exist but misses an edge case your test suite doesn’t cover. The build succeeds locally but fails in CI because of a dependency your dev environment handles differently. Issues and unknowns everywhere. The agent generated code. It didn’t ship a feature or a fix. Most organizations are stuck comparing benchmarks at this point. Which model scores 2% higher on SWE-bench? Which agent completes more tasks? These comparisons miss the point entirely.

The real bottleneck

The bottleneck isn’t the agent’s or the model’s ability to write code. It’s your ability to give it quality input and verify the output. Without clear intent and verification, AI-generated code is text. Text that might work. Text that might break production. You won’t know until a human reviews every line, runs manual tests, and crosses their fingers during deployment. That’s not autonomy. That’s a more sophisticated autocomplete. The question isn’t “how smart is the model?” The question is: can your infrastructure tell the model whether it got the answer right?

Software 2.0: a new programming paradigm

Andrej Karpathy recently articulated a distinction that reframes how to think about AI’s impact on software:

“Software 1.0 easily automates what you can specify. Software 2.0 easily automates what you can verify.”

In Software 1.0, you write explicit algorithms by hand. If you can specify the rules, you can automate the task. The key question is: is the algorithm fixed and can you specify it? Software 2.0 works differently. You specify objectives and search through the space of possible solutions. If you can verify whether a solution is correct, you can optimize for it. The key question becomes: is the task verifiable? Karpathy defines the requirements for this to work. The environment has to be resettable (you can start a new attempt), efficient (you can make many attempts), and rewardable (there’s an automated process to evaluate each attempt). This is why AI progress follows a “jagged frontier.” Tasks that are highly verifiable (math, coding, formal logic) advance the fastest. Tasks that are hard to verify (creative writing, strategic planning, design taste) progress more slowly. It’s not about difficulty. It’s about whether you can automate the feedback loop.

The asymmetry of verification

Jason Wei, researcher at Meta (formerly OpenAI), formalized a related concept: many tasks are far easier to verify than to solve. Think about Sudoku. Solving a puzzle takes time and concentration. Verifying a completed grid takes seconds: check each row, column, and box. This asymmetry exists on a spectrum. On one end, tasks like Sudoku, math proofs, and code tests are easy to verify, because there’s a clear right answer. In the middle, tasks like arithmetic and data processing are symmetric, with roughly equal effort to solve and check. On the far end, tasks like writing essays, forming hypotheses, and creative work are hard to verify, because quality is subjective and slow to evaluate. Wei identifies five properties that make a task verifiable:

Objective truth: there’s a clear definition of correctness
Fast to verify: checking takes seconds, not hours
Scalable: you can verify many solutions in parallel
Low noise: verification results correlate tightly with actual quality
Continuous reward: you can rank solutions on a gradient, not pass/fail

The ease of training AI to solve a task is proportional to how verifiable the task is. If a task is possible to solve and easy to verify, AI will solve it.

Code is highly verifiable

Here’s the good news: software development is one of the most verifiable domains that exists. You can run tests. You can check types. You can lint for style violations. You can deploy to a preview environment and confirm the feature works. You can measure performance before and after a change. You can scan for security vulnerabilities automatically. Software engineering has spent decades building verification infrastructure across eight distinct areas: testing, documentation, code quality, build systems, dev environments, observability, security, and standards. This accumulated infrastructure makes code one of the most favorable domains for AI agents. That’s why coding agents are the most advanced AI agents in the world right now. Not because code is simple (it isn’t). Because code is verifiable, when your infrastructure supports it.

Why humans cope but agents crash

But most codebases don’t fully support and provide these pillars. Human engineers are resilient. They work around incomplete infrastructure every day. Your team has 60% test coverage? A senior developer fills the gaps with intuition and careful review. Documentation is outdated? Someone asks a colleague or digs through Slack history. The build is flaky? Hit retry and hope it passes. No staging environment? Test in production and fix forward. These workarounds are inefficient, but humans adapt. They carry institutional knowledge, pattern recognition, and contextual judgment. They know which code paths are risky. They know which tests to trust and which to ignore. AI agents can’t do any of this. They have no institutional knowledge. No intuition. No context beyond what you explicitly provide.

What breaks agents

When verification infrastructure is missing, every gap becomes a wall:

No tests → the agent can’t validate whether its changes work
No specs or documentation → the agent makes wrong assumptions about how the system behaves
Flaky builds → the agent can’t distinguish its own bugs from infrastructure problems
No observability → the agent can’t infer the result of an actual deployment
No preview environments → the agent ships blind

Most organizations have partial infrastructure across these areas. Humans cope with 50-60% coverage. Agents need systematic coverage to succeed.

The result: “AI slop”

Without verification, you get code that looks plausible but subtly degrades your codebase over time. Eno Reyes, CEO at Factory (an AI coding agent company), calls this the inevitable outcome when agents operate without sufficient validation criteria. The code compiles. It might even pass the few tests that exist. But it introduces inconsistencies, ignores conventions, misses edge cases, and accumulates technical debt faster than your team can review it. That’s not a model problem. That’s an infrastructure problem. And it has an infrastructure solution.

Spec-driven development: lock intent before implementation

The first part of the solution is about input quality. Before your agent writes a single line of code, it needs to know what “correct” looks like. This is the principle behind spec-driven development: lock intent before implementation.

Traditional vs. spec-driven workflow

The traditional approach with AI agents looks like this: Prompt agent → Generate code → Hope it works This is unreliable, hard to debug, and doesn’t scale. The spec-driven approach inverts the process: Write specs + tests → Generate code → Validate → Iterate This is reliable, debuggable, and scales to complex tasks. Research from Ion Stoica, Matei Zaharia, and others at Berkeley confirms this pattern: when you combine specifications with validation loops, AI-generated code quality improves dramatically.

The three-step process

Spec-driven development follows three steps: 1. Define specs. Write tests, define types, and describe expected behavior before any code generation happens. This includes unit or e2e tests for the new feature, integration contracts with existing systems, and type definitions for new interfaces. 2. Generate solutions. The agent produces multiple candidate implementations. With clear specs, it can explore the solution space efficiently. If the first attempt fails validation, it tries a different approach, informed by the specific failure rather than guessing randomly. 3. Validate and select. Run the test suite. Check type safety. Verify linter compliance. Pick the implementation that passes all checks. If none pass, the agent gets specific feedback about what failed and iterates.

Why this works

Specs lock intent, so there’s no ambiguity about what the code should do. Changes become reviewable, because scope is transparent and diff-able. Success criteria are clear: tests pass or they don’t. Debugging becomes tractable, because you know what should happen and can pinpoint where it diverges. With strong verification, you can search for solutions instead of crafting them by hand. Engineering problems become search problems. And search is what AI does best. Tools like OpenSpec are emerging to support this workflow. OpenSpec generates structured proposals, task breakdowns, and specifications from a single prompt. You review and approve the spec before implementation begins. The agent then works from an approved plan rather than improvising from a vague description.

The 8 pillars of verification

Spec-driven development handles the input side. But you also need infrastructure to verify the output. This is where the 8 pillars come in. Getting your codebase agent-ready requires systematic coverage across eight areas. Gaps in any pillar limit agent autonomy. Not every organization needs perfection in all eight, but each gap represents a ceiling on what agents can accomplish without human intervention.

1. Testing: the foundation

Tests are the most direct form of verification. A failing test is an unambiguous signal. A passing test suite is evidence (not proof) of correctness.

What agents need: High coverage so that changes are validated against real expectations. Fast execution so agents can iterate in tight loops. Deterministic results, because flaky tests are worse than no tests. They create noise the agent can’t interpret.

Without it: The agent generates plausible-looking code that breaks in production. You’re back to manual review for every change. As Eno Reyes puts it, “a slop test is better than no test.” An imperfect AI-generated test that passes when changes are correct and fails when they aren’t is still a useful signal. Other agents will notice these tests, follow the patterns, and the coverage compounds over time.

2. Documentation and specs: the context layer

Documentation tells the agent why code is structured a certain way, not just what it does. API specs, architecture decision records, and integration guides provide the context that agents lack by default.

What agents need: Up-to-date documentation that describes system behavior, integration points, and known limitations. Edge cases and gotchas that aren’t captured in tests. Historical context about why certain decisions were made.

Without it: The agent makes wrong assumptions. It breaks contracts between services. The codebase becomes harder to maintain because the agent can’t understand existing patterns. This is the one pillar that’s entirely a team responsibility. No platform can generate your documentation for you. But agents themselves can help: generating API docs from code, updating READMEs, and maintaining architecture diagrams are all verifiable tasks where agents already perform well.

3. Code quality: the standards enforcer

Linters, formatters, type checkers, static analysis. These tools encode your team’s conventions into automated checks.

What agents need: Strict enforcement where linter failure means the build fails, not a warning to ignore. Automated gates that are requirements, not suggestions. Clear error messages that explain what violated which rule.

Without it: The agent generates inconsistent code. Each PR introduces slightly different patterns. The codebase becomes unpredictable and technical debt accumulates. The key distinction: if a human can ignore a linter warning, so can an agent. Gates must be gates.

4. Build systems: the reproducibility layer

Builds must be deterministic. Same commit, same result. Every time.

What agents need: Reproducible builds where the same commit produces the same environment. Fast failure where errors surface immediately with clear messages. No mystery failures, because “works on my machine” is invisible to agents.

Without it: The agent can’t distinguish its bugs from infrastructure problems. It wastes cycles retrying builds that fail for reasons unrelated to its code. Phantom errors consume token budget and developer patience.

5. Dev environments: the experimentation space

Preview environments give agents a safe, production-like space to validate changes before they touch production.

What agents need: Production parity where the test environment matches production, including data. Fast provisioning measured in minutes, not hours. Isolation so each experiment runs independently without interference.

Without it: The agent has no way to confirm changes before production deployment. It ships blind or requires a human to manually verify every change, defeating the purpose of automation. This is where the economics of verification shift dramatically. Instead of sharing a single staging environment (with all the coordination overhead that implies), each agent can get its own isolated environment. Test freely, break things, iterate, all without blocking anyone else.

6. Observability: the feedback signal

When something goes wrong after deployment, observability tools provide the signal to understand what happened.

What agents need: Structured logs in JSON format, not unstructured text dumps. Clear metrics with baselines, error rates, and resource utilization. Actionable traces and performance profiles that connect errors back to specific code paths.

Without it: The agent can’t debug failures. It can’t determine whether a change introduced a regression. Post-deployment verification becomes impossible without a human reading logs manually.

7. Security: the safety gate

AI-generated code must meet the same security standards as human-written code. Without automated scanning, agents can introduce vulnerabilities faster than your security team can review them.

What agents need: Automated scanning that runs on every change, every time. Dependency audits for every package on a recurring basis. Clear reports that specify what’s wrong and how to fix it. Secrets management that prevents hardcoded credentials.

Without it: The agent introduces vulnerabilities. Issues accumulate undetected. Breach risk increases with every unscanned PR.

8. Standards: the consistency framework

Coding conventions, architectural patterns, naming rules. Consistency makes codebases predictable, which helps agents understand and extend existing code correctly.

What agents need: Documented conventions that are written down, not tribal knowledge. Tooling enforcement through automation, not code review comments. Pattern libraries with examples of how things should be done.

Without it: The agent generates code that works but doesn’t fit. Every PR introduces a slightly different style. The codebase becomes unpredictable and harder for both humans and agents to maintain.

Where does Upsun fit?

Upsun doesn’t cover all eight pillars. Documentation and some aspects of standards are application-level concerns your team owns. But Upsun provides infrastructure for the pillars that matter most to deployment and verification. When an agent pushes to a branch, Upsun spins up a complete clone of production, including your database with optional sanitization. Tests run against real data and real service configurations. If the linter fails, the build fails. If type checks don’t pass, deployment stops. The agent gets immediate, unambiguous feedback. This turns your platform into a verification loop. The agent writes code, pushes to a branch, and the platform answers the question: does this work? If not, here’s exactly why.

The self-assessment

Take a moment and think about which pillar is your biggest gap right now. That’s where your agents will struggle most. For a more thorough audit, you can score your organization across all eight pillars using the 80-criteria verification infrastructure checklist. It covers ten specific criteria per pillar and gives you a concrete picture of where to invest.

The flywheel

Here’s where things get interesting. Once you start investing in verification infrastructure, a compounding loop kicks in. Better infrastructure enables agents to work more autonomously. More autonomy means agents can take on tasks like generating tests, writing documentation, adding type annotations, and improving linter rules. Those improvements make the infrastructure better. Which enables even more autonomy. Tests. Specs. Types. Coverage. The loop accelerates. An imperfect AI-generated test is still a test. Other agents will notice it, follow its patterns, and extend it. A junior developer who couldn’t use agents effectively yesterday can use them today because the verification gates catch mistakes automatically. One opinionated engineer who invests in writing strict linter rules and comprehensive specs scales their impact across the entire team, and across every agent the team uses. This is the new DevX loop. As Eno Reyes puts it, the investment in verification infrastructure is where the real 5-7x velocity gains come from. Not 1.5x. Not 2x. The organizations that build this feedback loop now will reach a development velocity that others can’t match.

What becomes possible

With strong verification in place, you move beyond single-task code generation to autonomous SDLC processes: Large task decomposition. Migration and modernization projects broken into verifiable subtasks. Each piece gets validated independently before integration. Verification through per-subtask tests, integration tests, and specs. Task parallelization. Multiple agents working in parallel on independent tasks. Isolated preview environments prevent conflicts. Isolated test suites eliminate cross-dependencies. Large migrations complete an order of magnitude faster. Automated code review. Context-aware reviews for every PR that catch bugs, style issues, and architectural problems. Verification through linters, tests, and security scans. QA and test generation. Agents that generate test scenarios, validate edge cases, and improve coverage as the code evolves. Tests must pass and meet coverage metrics to ship. Incident response. An error appears in production. The agent pulls observability data, proposes a fix, deploys to a preview environment, verifies the fix works, and opens a PR, before a human wakes up. Verification through error logs, monitoring metrics, and tests. Documentation automation. Keep docs in sync with code. Generate API documentation, update READMEs, and maintain architecture diagrams. Verification through doc builds, working examples, and valid links. Each of these processes relies on your verification infrastructure. The better your specs, tests, and validation, the more reliably they run.

The path forward

If you want AI agents that ship code, not generate text, the investment is in verification infrastructure. Here’s the sequence: 1. Assess. Measure your verification infrastructure across all eight pillars. Score each one honestly. Identify the gaps. 2. Improve. Systematically address those gaps through automated fixes. Use agents themselves to write tests, add type annotations, tighten linter rules, and generate documentation. Start where the leverage is highest: testing and build systems. 3. Deploy. Roll out AI agents that can use your specifications and verification gates. Match the level of supervision to the strength of each pillar. Strong verification means less oversight required. 4. Iterate. The flywheel spins. Continuous feedback improves both your specs and your agents. Each improvement compounds.

The limit isn’t the AI. It’s your infrastructure. Build the verification layer, and the autonomy follows.

Ready to build the verification infrastructure your AI agents need? Upsun provides preview environments, reproducible builds, and integrated observability out of the box. Create a free account and deploy your first project in minutes.

References

Karpathy, A. (2025). “Verifiability.” karpathy.bearblog.dev/verifiability
Wei, J. (2025). “Asymmetry of verification and verifier’s law.” jasonwei.net/blog
Reyes, E. (2025). “Verification Infrastructure for AI Agents.” Factory AI talk.
OpenSpec. github.com/openspec/openspec
Guillaume Moigneu (2026). “The 8 Pillars of Verification Infrastructure Checklist.” gist.github.com

Articles

​The promise

​But here’s what actually happens

​The real bottleneck

​Software 2.0: a new programming paradigm

​The asymmetry of verification

​Code is highly verifiable

​Why humans cope but agents crash

​What breaks agents

​The result: “AI slop”

​Spec-driven development: lock intent before implementation

​Traditional vs. spec-driven workflow

​The three-step process

​Why this works

​The 8 pillars of verification

​1. Testing: the foundation

​2. Documentation and specs: the context layer

​3. Code quality: the standards enforcer

​4. Build systems: the reproducibility layer

​5. Dev environments: the experimentation space

​6. Observability: the feedback signal

​7. Security: the safety gate

​8. Standards: the consistency framework

​Where does Upsun fit?

​The self-assessment

​The flywheel

​What becomes possible

​The path forward

​References

The promise

But here’s what actually happens

The real bottleneck

Software 2.0: a new programming paradigm

The asymmetry of verification

Code is highly verifiable

Why humans cope but agents crash

What breaks agents

The result: “AI slop”

Spec-driven development: lock intent before implementation

Traditional vs. spec-driven workflow

The three-step process

Why this works

The 8 pillars of verification

1. Testing: the foundation

2. Documentation and specs: the context layer

3. Code quality: the standards enforcer

4. Build systems: the reproducibility layer

5. Dev environments: the experimentation space

6. Observability: the feedback signal

7. Security: the safety gate

8. Standards: the consistency framework

Where does Upsun fit?

The self-assessment

The flywheel

What becomes possible

The path forward

References