AI Weekly Review - Mar. 9th 2026

The question flipped this week. Not “which AI model should we use?” but “how do we coordinate twenty agents running inside our stack?” Three open-source projects landed with three very different answers. OpenAI shipped GPT-5.4 with native computer use and a 1M-token context window. Anthropic sued the Pentagon after being labeled a supply chain risk. Karpathy dropped autoresearch for overnight autonomous ML experiments. And a Pragmatic Engineer survey of 900+ engineers put Claude Code at the top of the AI coding tool rankings.

Highlight of the week

The agent coordination era has arrived

AI agents can write code. That part is settled. Claude Code hit $1B ARR in six months, GitHub Copilot crossed 15M paid seats. The problem now is what happens when you have five, ten, twenty of them running at once. Three open-source projects that landed this week show just how little consensus there is on the answer despite a global trend. Symphony, from OpenAI, is a long-running daemon written in Elixir on OTP. It polls a Linear issue tracker, claims eligible tasks, spawns isolated Codex agents in hermetic workspaces, and delivers verified pull requests with CI status, walkthrough videos, and complexity analysis. What’s interesting is what it’s not: Symphony is not a multi-agent framework. Each issue gets exactly one agent. The “orchestration” is about managing many concurrent single-agent runs, not coordinating agents on a shared task. OpenAI calls this discipline harness engineering, which basically means: build the infrastructure, constraints, and feedback loops around the agent so it can’t go off the rails.

Not to be confused with my favorite development framework: Symfony.

Paperclip AI goes the other direction entirely. It models the whole company as the coordination abstraction: org charts, reporting lines, budgets, cascading goals from company mission down to individual tasks. The tagline: “If OpenClaw is an employee, Paperclip is the company.” Agents execute wherever they run (Claude Code, Codex, Bash) and phone home via a heartbeat system. 4.3K+ stars and 23 contributors already. Atomic task checkout prevents double-work, monthly per-agent budgets enforce cost controls, and an immutable ticket system traces every conversation and tool call. The human is “the board,” not a line-level supervisor. Agentlytics, from Fatih Kadir Akin (creator of prompts.chat), tackles a different gap entirely: nobody can see across all their agents at once. It’s a zero-config analytics dashboard (npx agentlytics) that reads local session data from 16 AI coding editors and shows unified metrics on token spending, model usage, and editor efficiency. Everything runs locally on SQLite. It’s early (82 stars, solo maintainer), but the problem it addresses only gets worse as you add more agents. I keep coming back to the framing from background-agents.com, published by Ona (formerly Gitpod). Their metaphor is “the self-driving codebase”: developers shift from driving every line of code to supervising. The site draws a sharp line between foreground agents (interactive Copilot, Claude Code sessions) and background agents, the cloud-native, event-driven kind that run autonomously on triggers like CVE disclosures or CI failures and produce pull requests you wake up to. Ona identifies three pillars for background agent infrastructure: isolated compute, event routing, and governance. Symphony covers isolation and governance but can only poll. Paperclip covers governance and event routing but delegates compute. Agentlytics covers none of the three but tells you what’s actually happening. The practical lesson: think in layers, not monoliths. The protocol stack is converging. MCP (Anthropic, now under the Linux Foundation) handles agent-to-tool connections with 97M+ monthly SDK downloads. A2A (Google, 150+ supporting organizations) handles agent-to-agent discovery. These are complementary. Build on them rather than custom integrations.

Models and research

GPT-5.4 launched March 5. It’s the first mainline OpenAI model that combines frontier coding capabilities (from GPT-5.3 Codex) with a 1M-token context window and native computer use. On OSWorld-Verified, it hits 75.0% success rate, beating the human baseline of 72.4%. The new “tool search” feature lets the model receive a lightweight list of available tools and look up definitions on demand, which cuts token cost when you have a lot of tools registered. Three variants: base, Thinking, and Pro. Input pricing starts at ~

2.50/M tokens, output at ~

10/M. The sheer volume of model releases is hard to keep up with. A startup-focused tracker counts over 255 releases in Q1 2026 alone: Gemini 3.1 Pro, Claude Opus 4.6, Claude Sonnet 4.6, Qwen 3.5, GLM-5, and DeepSeek V4 (1T parameters, expected imminently). 1M+ token context windows are table stakes now. Karpathy’s autoresearch (15.8K stars) is the open-source release I’m most excited about this week. It’s 630 lines of Python that let an agent autonomously run ML experiments overnight on a single GPU. The loop is simple: modify train.py, train for exactly 5 minutes, check if the result improved, keep or revert, repeat. You get ~100 experiments while you sleep. Shopify CEO Tobi Lutke tried it overnight on an internal model and got 19% validation improvement from 37 experiments. An Apple Silicon port (374 stars) runs the same loop on MLX, no CUDA required.

Coding agents and dev tools

A Pragmatic Engineer survey of 900+ engineers puts Claude Code at #1 for AI coding tools, with 71% of regular agent users relying on it. Adoption is highest at small companies (75%) and among senior engineers: Director-level respondents name it twice as often as junior ranks. 95% of respondents use AI tools weekly. 70% use 2-4 tools simultaneously, so the “pick one tool” era seems genuinely over. Cursor keeps growing (~35% in nine months), and OpenAI’s Codex is picking up fast. Simon Willison wrote something I found worth sitting with. In “Perhaps not boring technology after all”, he expected coding agents to push everyone toward mainstream stacks with abundant training data. Instead, modern agents adapt well to niche and proprietary tools by consulting existing codebase examples, iterating through tests, and reading docs within their large context windows. The “choose boring technology” fear hasn’t materialized. Engram (911 stars) tackles persistent memory for coding agents: a single Go binary with zero dependencies, using SQLite + FTS5 across 8+ editors via MCP. 13 built-in tools including mem_save, mem_search, and session management. Overstory (867 stars) does multi-agent orchestration through git worktrees. It spawns worker agents in isolated worktrees via tmux, coordinates them through a custom SQLite messaging system, and uses a 4-tier conflict resolution system for merging results. Supports Claude Code, Gemini CLI, and GitHub Copilot runtimes. The README honestly warns about the compounding risks of multi-agent swarms, which you don’t see enough of.

Industry and business

Anthropic sued the Department of Defense after being designated a “supply chain risk,” a label historically reserved for companies tied to foreign adversaries. The dispute started when Anthropic refused to let the Pentagon use Claude without safeguards against mass surveillance and fully autonomous weapons. The Trump administration canceled all federal contracts and blacklisted the company. OpenAI signed its own DoD deal, reportedly without the same limitations. ChatGPT uninstalls surged 295% afterward, which is worth noting regardless of where you land on this. Bruce Schneier argues the real problem is inadequate democratic governance, not corporate ethics: “We should not rest on our laurels, thinking that either is doing so in the public’s interest.” Gartner [forecasts

2.5T in global AI spending for 2026](https://www.gartner.com/en/newsroom/press-releases/2026-1-15-gartner-says-worldwide-ai-spending-will-total-2-point-5-trillion-dollars-in-2026), a 44% jump YoY. AI infrastructure alone adds

401B. But Gartner also says AI is in the “Trough of Disillusionment” throughout 2026, meaning it’ll mostly be sold by incumbent software providers rather than as moonshot projects. Projection: $3.33T by 2027. OpenAI and Oracle have reportedly structured a deal worth up to $300B in cloud capacity over time. That is an absurd number. The US Supreme Court declined to hear the Thaler/DABUS case, so purely AI-generated works without a human author remain not copyrightable under US law.

Interesting GitHub repositories

Ennyn - Local development proxy in Go that replaces port juggling with hostname-based routing. Each service gets a hostname like myapp.localhost (resolves to 127.0.0.1 per RFC 6761, no /etc/hosts edits). Generates a local CA with wildcard *.localhost certs for automatic HTTPS, integrates process management, and ships as a single zero-dependency binary for macOS, Linux, and Windows. Apache 2.0. autoresearch (15.8K stars) - Karpathy’s overnight autonomous ML experiment runner. 630 lines of Python, single GPU, ~100 experiments while you sleep. The program.md instruction file is the key interface: the clearer your research direction, the better the agent navigates the search space. autoresearch-mlx (374 stars) - Apple Silicon port of autoresearch using MLX instead of PyTorch. M4 Max achieved 19% improvement in validation metrics through autonomous optimization overnight. Engram (911 stars) - Persistent memory for coding agents. Single Go binary, SQLite + FTS5, 13 MCP tools, supports 8+ editors. Topic-key workflow ensures memories update rather than duplicate. Nenya - Self-hosted AI memory layer backed by Postgres and pgvector. Claude remembers things about you, so does ChatGPT, so does Cursor, but none of them talk to each other and you own none of it. Nenya creates a unified memory backend accessible via MCP and REST API, with automatic entity extraction, semantic search, and direct import from Claude and ChatGPT memory exports. Apache 2.0.\ Overstory (867 stars) - Multi-agent orchestration via git worktrees + tmux + SQLite mail system. Deploys Scout, Builder, Reviewer, and Merger agents in parallel. Honest about the risks of swarm workflows. jCodeMunch-MCP (937 stars) - Token-efficient MCP server for codebase exploration. Uses tree-sitter AST parsing across 13+ languages to retrieve only the exact symbols agents need, claiming up to 99% token savings versus brute-force file reading. Uncodixfy (1.2K stars) - A ruleset that prevents AI from generating stereotypical “GPT UI” (floating cards, oversized rounded corners, gradient dashboards). Works as negative constraints rather than design instruction. Available as a SKILL.md for Claude Code and Codex. OpenReview (815 stars) - Self-hosted AI code review bot from Vercel Labs. Claude-powered, sandboxed execution, inline PR suggestions, can commit fixes directly. Open Terminal (1K stars) - From the Open WebUI team. A lightweight REST API sandbox where agents can execute commands, manage files, and run code. Docker or bare metal, multi-user isolation. claude-replay (403 stars) - Converts Claude Code and Cursor session transcripts into interactive, self-contained HTML replays with playback controls, bookmarks, and automatic secret redaction. Arbor (234 stars) - Native Rust desktop app for agentic coding workflows built with GPUI. Manages repos, worktrees, embedded terminals, file diffs, and real-time agent detection in a three-pane layout. One-liners: WebReel (632 stars, scripted browser demo recording) | OpusDelta (174 stars, AI emotion visualization in 3D geometry) | RepoCheck (148 stars, Python/PyTorch reproducibility auditor) | AnythingLLM (self-hosted RAG platform, continues gaining adoption).

Quick bits

MIT Technology Review named generative coding a “Ten Breakthrough Technologies” of 2026.
Nate B Jones did a 32-minute deep dive on Claude’s scheming behavior: why frontier models attempt to manipulate developers and what it means for trust in agent workflows.
The International AI Safety Report 2026 (Bengio et al.) warns that capability growth is outpacing safety evidence, with attackers still bypassing protections at a “moderately high” rate.
Markus Persson (Notch) called AI coding “an incredibly bad idea”, citing hidden bugs and loss of understanding.
The GPU capacity crunch is pushing enterprises toward hybrid architectures: private clusters for training, public cloud for burst, sovereign for regulated data.
ByteByteGo reports OpenClaw at 263K stars. Local-first AI infrastructure (Ollama, Open WebUI, DeepSeek-V3) keeps growing.
Virginia lawmakers are proposing AI guardrails in K-12 education.
Enterprise adoption remains uneven: McKinsey found 62% experimenting with agents and 23% scaling, but BCG says fewer than 20% are hitting expected ROI. Gartner warns over 40% of agentic AI projects could be canceled by 2027.

Articles

AI Weekly Review - Mar. 9th 2026

Highlight of the week

The agent coordination era has arrived

Models and research

Coding agents and dev tools

Industry and business

Interesting GitHub repositories

Quick bits

Sources

Articles

​Highlight of the week

​The agent coordination era has arrived

​Models and research

​Coding agents and dev tools

​Industry and business

​Interesting GitHub repositories

​Quick bits

​Sources

Highlight of the week

The agent coordination era has arrived

Models and research

Coding agents and dev tools

Industry and business

Interesting GitHub repositories

Quick bits

Sources