The promise
Point an agent at a ticket. Come back to a working merged PR.
But here’s what actually happens
You point an agent at a ticket. It reads the codebase, picks a direction, and starts writing code. Twenty minutes later, you’ve got a PR with 400 lines of changes. It looks reasonable. The syntax is correct. The logic seems plausible. Then you start looking closer. The code doesn’t match your team’s conventions. It makes assumptions about how a service behaves that haven’t been true since the last refactor. It passes the tests that exist but misses an edge case your test suite doesn’t cover. The build succeeds locally but fails in CI because of a dependency your dev environment handles differently. Issues and unknowns everywhere. The agent generated code. It didn’t ship a feature or a fix. Most organizations are stuck comparing benchmarks at this point. Which model scores 2% higher on SWE-bench? Which agent completes more tasks? These comparisons miss the point entirely.The real bottleneck
The bottleneck isn’t the agent’s or the model’s ability to write code. It’s your ability to give it quality input and verify the output. Without clear intent and verification, AI-generated code is text. Text that might work. Text that might break production. You won’t know until a human reviews every line, runs manual tests, and crosses their fingers during deployment. That’s not autonomy. That’s a more sophisticated autocomplete. The question isn’t “how smart is the model?” The question is: can your infrastructure tell the model whether it got the answer right?Software 2.0: a new programming paradigm
Andrej Karpathy recently articulated a distinction that reframes how to think about AI’s impact on software:“Software 1.0 easily automates what you can specify. Software 2.0 easily automates what you can verify.”In Software 1.0, you write explicit algorithms by hand. If you can specify the rules, you can automate the task. The key question is: is the algorithm fixed and can you specify it? Software 2.0 works differently. You specify objectives and search through the space of possible solutions. If you can verify whether a solution is correct, you can optimize for it. The key question becomes: is the task verifiable? Karpathy defines the requirements for this to work. The environment has to be resettable (you can start a new attempt), efficient (you can make many attempts), and rewardable (there’s an automated process to evaluate each attempt). This is why AI progress follows a “jagged frontier.” Tasks that are highly verifiable (math, coding, formal logic) advance the fastest. Tasks that are hard to verify (creative writing, strategic planning, design taste) progress more slowly. It’s not about difficulty. It’s about whether you can automate the feedback loop.
The asymmetry of verification
Jason Wei, researcher at Meta (formerly OpenAI), formalized a related concept: many tasks are far easier to verify than to solve. Think about Sudoku. Solving a puzzle takes time and concentration. Verifying a completed grid takes seconds: check each row, column, and box. This asymmetry exists on a spectrum. On one end, tasks like Sudoku, math proofs, and code tests are easy to verify, because there’s a clear right answer. In the middle, tasks like arithmetic and data processing are symmetric, with roughly equal effort to solve and check. On the far end, tasks like writing essays, forming hypotheses, and creative work are hard to verify, because quality is subjective and slow to evaluate. Wei identifies five properties that make a task verifiable:- Objective truth: there’s a clear definition of correctness
- Fast to verify: checking takes seconds, not hours
- Scalable: you can verify many solutions in parallel
- Low noise: verification results correlate tightly with actual quality
- Continuous reward: you can rank solutions on a gradient, not pass/fail
Code is highly verifiable
Here’s the good news: software development is one of the most verifiable domains that exists. You can run tests. You can check types. You can lint for style violations. You can deploy to a preview environment and confirm the feature works. You can measure performance before and after a change. You can scan for security vulnerabilities automatically. Software engineering has spent decades building verification infrastructure across eight distinct areas: testing, documentation, code quality, build systems, dev environments, observability, security, and standards. This accumulated infrastructure makes code one of the most favorable domains for AI agents. That’s why coding agents are the most advanced AI agents in the world right now. Not because code is simple (it isn’t). Because code is verifiable, when your infrastructure supports it.Why humans cope but agents crash
But most codebases don’t fully support and provide these pillars. Human engineers are resilient. They work around incomplete infrastructure every day. Your team has 60% test coverage? A senior developer fills the gaps with intuition and careful review. Documentation is outdated? Someone asks a colleague or digs through Slack history. The build is flaky? Hit retry and hope it passes. No staging environment? Test in production and fix forward. These workarounds are inefficient, but humans adapt. They carry institutional knowledge, pattern recognition, and contextual judgment. They know which code paths are risky. They know which tests to trust and which to ignore. AI agents can’t do any of this. They have no institutional knowledge. No intuition. No context beyond what you explicitly provide.What breaks agents
When verification infrastructure is missing, every gap becomes a wall:- No tests → the agent can’t validate whether its changes work
- No specs or documentation → the agent makes wrong assumptions about how the system behaves
- Flaky builds → the agent can’t distinguish its own bugs from infrastructure problems
- No observability → the agent can’t infer the result of an actual deployment
- No preview environments → the agent ships blind
The result: “AI slop”
Without verification, you get code that looks plausible but subtly degrades your codebase over time. Eno Reyes, CEO at Factory (an AI coding agent company), calls this the inevitable outcome when agents operate without sufficient validation criteria. The code compiles. It might even pass the few tests that exist. But it introduces inconsistencies, ignores conventions, misses edge cases, and accumulates technical debt faster than your team can review it. That’s not a model problem. That’s an infrastructure problem. And it has an infrastructure solution.Spec-driven development: lock intent before implementation
The first part of the solution is about input quality. Before your agent writes a single line of code, it needs to know what “correct” looks like. This is the principle behind spec-driven development: lock intent before implementation.Traditional vs. spec-driven workflow
The traditional approach with AI agents looks like this: Prompt agent → Generate code → Hope it works This is unreliable, hard to debug, and doesn’t scale. The spec-driven approach inverts the process: Write specs + tests → Generate code → Validate → Iterate This is reliable, debuggable, and scales to complex tasks. Research from Ion Stoica, Matei Zaharia, and others at Berkeley confirms this pattern: when you combine specifications with validation loops, AI-generated code quality improves dramatically.The three-step process
Spec-driven development follows three steps: 1. Define specs. Write tests, define types, and describe expected behavior before any code generation happens. This includes unit or e2e tests for the new feature, integration contracts with existing systems, and type definitions for new interfaces. 2. Generate solutions. The agent produces multiple candidate implementations. With clear specs, it can explore the solution space efficiently. If the first attempt fails validation, it tries a different approach, informed by the specific failure rather than guessing randomly. 3. Validate and select. Run the test suite. Check type safety. Verify linter compliance. Pick the implementation that passes all checks. If none pass, the agent gets specific feedback about what failed and iterates.Why this works
Specs lock intent, so there’s no ambiguity about what the code should do. Changes become reviewable, because scope is transparent and diff-able. Success criteria are clear: tests pass or they don’t. Debugging becomes tractable, because you know what should happen and can pinpoint where it diverges. With strong verification, you can search for solutions instead of crafting them by hand. Engineering problems become search problems. And search is what AI does best. Tools like OpenSpec are emerging to support this workflow. OpenSpec generates structured proposals, task breakdowns, and specifications from a single prompt. You review and approve the spec before implementation begins. The agent then works from an approved plan rather than improvising from a vague description.The 8 pillars of verification
Spec-driven development handles the input side. But you also need infrastructure to verify the output. This is where the 8 pillars come in. Getting your codebase agent-ready requires systematic coverage across eight areas. Gaps in any pillar limit agent autonomy. Not every organization needs perfection in all eight, but each gap represents a ceiling on what agents can accomplish without human intervention.1. Testing: the foundation
Tests are the most direct form of verification. A failing test is an unambiguous signal. A passing test suite is evidence (not proof) of correctness.What agents need: High coverage so that changes are validated against real expectations. Fast execution so agents can iterate in tight loops. Deterministic results, because flaky tests are worse than no tests. They create noise the agent can’t interpret.
2. Documentation and specs: the context layer
Documentation tells the agent why code is structured a certain way, not just what it does. API specs, architecture decision records, and integration guides provide the context that agents lack by default.What agents need: Up-to-date documentation that describes system behavior, integration points, and known limitations. Edge cases and gotchas that aren’t captured in tests. Historical context about why certain decisions were made.
3. Code quality: the standards enforcer
Linters, formatters, type checkers, static analysis. These tools encode your team’s conventions into automated checks.What agents need: Strict enforcement where linter failure means the build fails, not a warning to ignore. Automated gates that are requirements, not suggestions. Clear error messages that explain what violated which rule.
4. Build systems: the reproducibility layer
Builds must be deterministic. Same commit, same result. Every time.What agents need: Reproducible builds where the same commit produces the same environment. Fast failure where errors surface immediately with clear messages. No mystery failures, because “works on my machine” is invisible to agents.
5. Dev environments: the experimentation space
Preview environments give agents a safe, production-like space to validate changes before they touch production.What agents need: Production parity where the test environment matches production, including data. Fast provisioning measured in minutes, not hours. Isolation so each experiment runs independently without interference.
6. Observability: the feedback signal
When something goes wrong after deployment, observability tools provide the signal to understand what happened.What agents need: Structured logs in JSON format, not unstructured text dumps. Clear metrics with baselines, error rates, and resource utilization. Actionable traces and performance profiles that connect errors back to specific code paths.
7. Security: the safety gate
AI-generated code must meet the same security standards as human-written code. Without automated scanning, agents can introduce vulnerabilities faster than your security team can review them.What agents need: Automated scanning that runs on every change, every time. Dependency audits for every package on a recurring basis. Clear reports that specify what’s wrong and how to fix it. Secrets management that prevents hardcoded credentials.
8. Standards: the consistency framework
Coding conventions, architectural patterns, naming rules. Consistency makes codebases predictable, which helps agents understand and extend existing code correctly.What agents need: Documented conventions that are written down, not tribal knowledge. Tooling enforcement through automation, not code review comments. Pattern libraries with examples of how things should be done.
Where does Upsun fit?
Upsun doesn’t cover all eight pillars. Documentation and some aspects of standards are application-level concerns your team owns. But Upsun provides infrastructure for the pillars that matter most to deployment and verification. When an agent pushes to a branch, Upsun spins up a complete clone of production, including your database with optional sanitization. Tests run against real data and real service configurations. If the linter fails, the build fails. If type checks don’t pass, deployment stops. The agent gets immediate, unambiguous feedback. This turns your platform into a verification loop. The agent writes code, pushes to a branch, and the platform answers the question: does this work? If not, here’s exactly why.The self-assessment
Take a moment and think about which pillar is your biggest gap right now. That’s where your agents will struggle most. For a more thorough audit, you can score your organization across all eight pillars using the 80-criteria verification infrastructure checklist. It covers ten specific criteria per pillar and gives you a concrete picture of where to invest.The flywheel
Here’s where things get interesting. Once you start investing in verification infrastructure, a compounding loop kicks in. Better infrastructure enables agents to work more autonomously. More autonomy means agents can take on tasks like generating tests, writing documentation, adding type annotations, and improving linter rules. Those improvements make the infrastructure better. Which enables even more autonomy. Tests. Specs. Types. Coverage. The loop accelerates. An imperfect AI-generated test is still a test. Other agents will notice it, follow its patterns, and extend it. A junior developer who couldn’t use agents effectively yesterday can use them today because the verification gates catch mistakes automatically. One opinionated engineer who invests in writing strict linter rules and comprehensive specs scales their impact across the entire team, and across every agent the team uses. This is the new DevX loop. As Eno Reyes puts it, the investment in verification infrastructure is where the real 5-7x velocity gains come from. Not 1.5x. Not 2x. The organizations that build this feedback loop now will reach a development velocity that others can’t match.What becomes possible
With strong verification in place, you move beyond single-task code generation to autonomous SDLC processes: Large task decomposition. Migration and modernization projects broken into verifiable subtasks. Each piece gets validated independently before integration. Verification through per-subtask tests, integration tests, and specs. Task parallelization. Multiple agents working in parallel on independent tasks. Isolated preview environments prevent conflicts. Isolated test suites eliminate cross-dependencies. Large migrations complete an order of magnitude faster. Automated code review. Context-aware reviews for every PR that catch bugs, style issues, and architectural problems. Verification through linters, tests, and security scans. QA and test generation. Agents that generate test scenarios, validate edge cases, and improve coverage as the code evolves. Tests must pass and meet coverage metrics to ship. Incident response. An error appears in production. The agent pulls observability data, proposes a fix, deploys to a preview environment, verifies the fix works, and opens a PR, before a human wakes up. Verification through error logs, monitoring metrics, and tests. Documentation automation. Keep docs in sync with code. Generate API documentation, update READMEs, and maintain architecture diagrams. Verification through doc builds, working examples, and valid links. Each of these processes relies on your verification infrastructure. The better your specs, tests, and validation, the more reliably they run.The path forward
If you want AI agents that ship code, not generate text, the investment is in verification infrastructure. Here’s the sequence: 1. Assess. Measure your verification infrastructure across all eight pillars. Score each one honestly. Identify the gaps. 2. Improve. Systematically address those gaps through automated fixes. Use agents themselves to write tests, add type annotations, tighten linter rules, and generate documentation. Start where the leverage is highest: testing and build systems. 3. Deploy. Roll out AI agents that can use your specifications and verification gates. Match the level of supervision to the strength of each pillar. Strong verification means less oversight required. 4. Iterate. The flywheel spins. Continuous feedback improves both your specs and your agents. Each improvement compounds.The limit isn’t the AI. It’s your infrastructure. Build the verification layer, and the autonomy follows.
References
- Karpathy, A. (2025). “Verifiability.” karpathy.bearblog.dev/verifiability
- Wei, J. (2025). “Asymmetry of verification and verifier’s law.” jasonwei.net/blog
- Reyes, E. (2025). “Verification Infrastructure for AI Agents.” Factory AI talk.
- OpenSpec. github.com/openspec/openspec
- Guillaume Moigneu (2026). “The 8 Pillars of Verification Infrastructure Checklist.” gist.github.com