AI Weekly Review - Apr. 13th 2026

Claude Mythos Preview landed this week with 93.9% on SWE-bench and thousands of discovered zero-day vulnerabilities, but Anthropic isn’t letting anyone use it. Meta doubled down on inference infrastructure with a $21 billion CoreWeave expansion, bringing total commitments to $35 billion. OpenClaw shipped native Codex integration and Active Memory. Mario Zechner made the case for minimal agent harnesses at Tessel, and Nate B Jones mapped five durable verticals where builders can survive the AI middleware collapse. The theme: the tools keep getting more powerful, but the hard questions are shifting from “can AI do this?” to “who gets to use it, who pays, and who’s liable?”

Highlight of the week

Claude Mythos Preview: Anthropic’s most powerful model exists, and you can’t have it. Anthropic announced Claude Mythos Preview on April 7, a frontier model that scores 93.9% on SWE-bench verified (up from Opus 4.6’s 80.8%) and represents a genuine step change in code understanding. But the headline isn’t the benchmark. It’s what Anthropic did with the model before announcing it. Over several weeks, Anthropic used Mythos to scan major operating systems, browsers, and widely-used software for vulnerabilities. The result: thousands of previously unknown zero-days, including a 27-year-old bug in OpenBSD and a 16-year-old flaw in FFmpeg’s H.264 codec. Through Project Glasswing, they’re coordinating responsible disclosure with a consortium of tech companies rather than releasing the model publicly. This is unprecedented. An AI lab discovers that its model is too good at finding vulnerabilities, so instead of shipping it, it uses the model to fix the internet first. The risk report is worth reading. Anthropic is explicitly saying: “We do not plan to make Claude Mythos Preview generally available.” Whether you read this as responsible stewardship or competitive gatekeeping depends on your priors. But the practical implication is clear: the gap between what frontier labs can do internally and what they release publicly is widening. Coding agents capable of autonomous vulnerability research at scale change the security landscape in ways we haven’t fully processed yet.

Models and research

Google’s Gemma 4 continues to dominate the open model conversation. Released under Apache 2.0 the previous week, Gemma 4 has now crossed 2 million downloads and is settling into production stacks. The 31B dense model’s numbers bear repeating: LiveCodeBench jumped from 29.1% to 80.0%, AIME math from 20.8% to 89.2%. Google announced this week that Gemma 4 can run fully offline on Android via AICore, and the edge models (E2B and E4B) are already running on Raspberry Pi hardware. The 256K context window, native vision and audio, and 140+ language support make this the most practically useful open model family released to date. GPT-5.4 is now a month old, and the picture is clearer. OpenAI’s March 5 release brought a 1-million-token context window, native computer use, and tool search. On OSWorld-V, it scored 75%, slightly above the human baseline of 72.4%. A month in, the most interesting development is how GPT-5.4 and Codex are converging: Codex now supports the full 1M context experimentally, and sub-agents use readable path-based addresses. The plugin system has matured into a first-class workflow where Codex syncs product-scoped plugins at startup. Yann LeCun’s AMI raises $1.03B and the debate over LLM limits heats up. LeCun’s Paris-based startup closed the largest European seed round ever at a $3.5B valuation, backed by Nvidia, Toyota, Samsung, Bezos Expeditions, and Temasek. The thesis: intelligence requires world models, not just language. AMI’s March 2026 paper demonstrates a 15-million-parameter model trained on a single GPU in hours that learns physics from video, using JEPA (Joint Embedding Predictive Architecture) with a new regularizer called CIG that prevents representation collapse. Two French-language pieces this week offered contrasting perspectives. Grand Angle Nova laid out LeCun’s vision enthusiastically: AI needs to move from generating to simulating, from predicting the next token to understanding causality. The world model is 48x faster at planning physical actions than generative approaches and uses 200x fewer tokens. The IA Clash podcast offered a sharper counterpoint, noting that LeCun’s theoretical arguments against LLMs keep getting contradicted by empirical results. His claim that autoregressive generation diverges exponentially assumes token errors are independent, which they demonstrably are not. LLMs keep getting longer and more accurate simultaneously. Francois Chollet’s critique of LeCun’s data bandwidth argument is devastating: blind people are equally intelligent despite processing far less sensory data, because intelligence depends on the complexity of the environment, not raw input bandwidth. My assessment: LeCun may be right that world models unlock capabilities LLMs can’t reach, but his specific arguments for why LLMs are fundamentally limited have been wrong repeatedly. The $1.03B bet will be fascinating to watch play out. AlphaEvolve is quietly significant. Google DeepMind’s coding agent pairs Gemini with evolutionary algorithms to optimize real production code. It’s been running inside Google for over a year, recovering 0.7% of global compute (a meaningful number at Google’s scale) and improving a core Gemini training kernel by 23%. The AlphaEvolve Service API is now available through early access on Google Cloud.

Coding agents and dev tools

Mario Zechner’s talk at Tessel is the best critique of coding agent bloat I’ve seen this year. The libGDX creator and builder of Pi (the minimal agent that powers OpenClaw) spoke at Tessel about why he built his own coding harness after trying Claude Code, OpenCode, Codex CLI, and Amp. His observations land because they’re specific. On Claude Code: “It does so many things that you actually probably ever use like 5% of what it offers.” On OpenCode’s session compaction pruning all tool results before the last 40K tokens: “What does this do to your prompt cache? Lost.” On LSP feedback mid-edit: injecting compiler errors after each individual edit (before the agent is done editing) causes models to panic and abandon their plan. The fix is obvious but most harnesses get it wrong: lint only when the agent thinks it’s done. His sharpest insight comes from TerminalBench results. Terminus, which gives the model nothing but a tmux session and keystrokes, performs near the top of the leaderboard. Pi, with just four tools (read, write, edit, bash), sits right behind it. This suggests that most of the features in mainstream coding agents (sub-agents, MCP, plan mode, background bash, built-in todos) aren’t necessary for raw model performance. Pi’s 600-line TUI and shortest system prompt of any major agent are a deliberate bet that extensibility beats features. Two theses from the talk worth internalizing: nobody knows what the ideal coding agent looks like yet, and we need agents that can self-modify so we can experiment faster. The community extensions prove the point. Pi Annotate lets you annotate a website and feed visual feedback to the agent. File Switch opens modified files inline. Pi Messenger creates a chat room between multiple Pi agents. None of it is built-in. All of it took hours, not weeks. He also raised a problem that’s getting worse: AI-generated open source spam. His solution is a human verification gate. Write a short issue introducing yourself (anything longer than one screen is probably “clanker slop”), get added to a contributor file, then you can submit PRs. Mitchell Hashimoto built this into a tool called Vouch. Crude, but it works. OpenClaw 2026.4.10 ships native Codex and Active Memory. The April 11 release bundles a Codex provider (so codex/gpt-* models use Codex-managed auth and threads while openai/gpt-* stays on the normal provider path), an Active Memory plugin that runs a memory sub-agent before each reply to pull in relevant context automatically, local MLX speech for Talk Mode on macOS, and a batch of security hardening across browser navigation, sandbox behavior, and plugin install scanning. The exec-policy CLI adds show, preset, and set subcommands for managing tool approval configurations. OpenHarness launches as an open-source agent harness from HKU. OpenHarness is a Python implementation of the Claude Code architecture pattern: 43 tools, skill loading from markdown files, multi-agent coordination, and a permission system with path-level rules. What makes it interesting is compatibility: it works with Claude, OpenAI, Copilot, Codex, Moonshot/Kimi, and GLM providers. The bundled personal agent “ohmo” connects to Slack, Telegram, Discord, and Feishu, running on your existing Claude Code or Codex subscription. At v0.1.6, it’s early, but the auto-compaction feature (preserving task state across context compression for multi-day sessions) addresses a real pain point. The Codex cross-provider plugin is worth noting. A Codex plugin for Claude Code lets you run Codex as a second-opinion agent inside a Claude Code session. Cross-provider code review without switching terminals. The walls between coding agent ecosystems are getting thinner.

Web development and frameworks

Browsers are becoming AI workspaces, whether developers want it or not. Samsung announced it’s expanding its browser with agentic AI across devices: natural-language history search (“that smartwatch I looked at last week”) and multi-tab summarization. Chrome ships with built-in Gemini 3 and “Auto Browse.” Edge has Copilot agent modes. Brave has Leo AI. Opera has Aria/Neon agents. For web developers, this creates two practical concerns. First, security: when browser copilots can “see” everything on the page, connecting them to internal repos or admin consoles becomes a data leak vector. Teams need to revisit what AI-enhanced browsers can access. Second, discoverability: AI search engines and browser copilots are increasingly the first touchpoint for users, which means dev portals and documentation need to be structured for AI consumption from the start. The edge keeps getting more capable. Between Gemma 4 running offline on phones, Cloudflare Workers getting memory-efficient image processing (the sip library), and Google’s TurboQuant KV-cache compression making long-context inference cheaper, the boundary between “edge” and “cloud” AI keeps blurring. For frontend developers building AI features, the question is no longer if you can run inference locally, but when it makes sense to.

Industry and business

Meta’s $21 billion CoreWeave expansion is an inference bet. CNBC reported that Meta committed an additional $21B to CoreWeave on April 9, bringing total contracts to $35B through 2032. The deal runs on Nvidia Rubin systems and focuses specifically on inference, not training. Meta’s capex guidance for 2026 sits at $115-135B, nearly double 2025. The strategic signal: hyperscaler-independent GPU clouds can win very large, long-term contracts, and inference is where the compute demand is shifting. The AI app builder middleware trap is playing out in real time. Nate B Jones published a sharp analysis of why companies like Lovable ($330M raised, $6.6B valuation, 100K new projects per day) are functionally thin wrappers around base models with fragile moats. His framework identifies five durable verticals that AI can’t commoditize: trust (Stripe, Shopify), context (Notion, Salesforce, Snowflake), distribution (Google, Apple, Amazon), taste (human editorial judgment and orchestration quality), and liability (regulated industries selling accountability). The key insight: if a better model makes your product obsolete, change your positioning now. Keith Rabois reinforced this on Lenny’s Podcast, noting that the number one consumer of AI tokens at two major companies he advises is the CMO, not engineering. His take on PMs: “The idea of a PM makes no sense in the future. The skill is more like being a CEO now, which is what are we building and why?” Shopify hasn’t allowed PowerPoint product presentations for two years; every pitch must be a working demo. Design and code are merging, and the premium shifts to business acumen over technical execution. OpenAI’s advertising ambitions raise questions. Marketing analysts cite internal projections of $2.5B in ad revenue for 2026 with long-range targets of $100B annually. If ChatGPT becomes an advertising platform, the incentive alignment between model helpfulness and ad revenue deserves scrutiny. AI policy is fragmenting and consolidating simultaneously. The White House released a National Policy Framework for AI pushing federal preemption, while Colorado’s SB24-205 (high-risk AI systems in employment, housing, healthcare) remains on track for June 30. California has multiple sector-specific AI bills in committee. The FTC reversed a consent order against Rytr, signaling lighter federal enforcement. For builders: high-risk domains will face stricter obligations regardless of federal direction.

Interesting GitHub repositories

HKUDS/OpenHarness - Open-source Python agent harness with 43 tools, skill loading, multi-agent coordination, and multi-provider support (Claude, OpenAI, Copilot, Codex, Kimi, GLM). The “ohmo” personal agent connects to chat platforms and runs on existing subscriptions. Worth watching for researchers who want to understand how production agents work under the hood.
3dsvg - Turns any 2D SVG into an interactive 3D component in the browser. Pick material presets (glass, chrome, holographic), add animations, export as 4K PNG or 60fps MP4 using FFmpeg WASM locally. Outputs clean React Three Fiber JSX.
debug-agent - A Claude Code/Cursor skill that changes the debugging approach entirely. Instead of guessing at fixes from static code, it injects lightweight NDJSON logging, asks you to reproduce the bug, analyzes live runtime logs, and only writes a fix when it has evidence. Verify-then-fix instead of guess-and-pray.
llmwiki - Implementation of Karpathy’s argument that RAG is broken because it retrieves the same raw chunks every time. Upload PDFs, connect Claude via MCP, and it synthesizes encyclopedia-style articles, flags contradictions, and updates cross-references across files. Persistent knowledge base instead of stateless retrieval.
helixent - Lightweight TypeScript library built on Bun for React-style agent loops: reasoning, planning, executing. Minimal building blocks for custom agent loops without orchestration framework bloat.
boxsh - Sandboxed shell designed for AI agents via concurrent JSON-line RPC. Linux namespace isolation traps every command so your agent can run builds and install packages without touching the host. Addresses a genuine security gap in agent tooling.
hypatia - AI memory system built in Rust using SQLite FTS5 and DuckDB with a custom JSON query language. No external models, no vector embeddings. A contrarian bet that structured search beats fuzzy similarity for agent memory.
liteparse_samples - Verification layer for AI document extraction. Every extracted fact includes an interactive link opening the source document with a bounding box around the exact text. Essential for legal, financial, and academic workflows where hallucinated citations break trust.
sip - Memory-efficient image processing library built for Cloudflare Workers constraints. Resize and optimize on the edge without bouncing to a heavy backend.

Quick bits

MCP crossed 97 million installs in March 2026, 4,750% growth in 16 months. Every major AI provider now ships MCP-compatible tooling. The protocol war is over.
Google’s TurboQuant KV-cache compression (ICLR 2026) combines PolarQuant rotations with quantized Johnson-Lindenstrauss embeddings to shrink memory for long-context inference. Cheaper long-context is getting real.
Meta laid off 200 more workers while planning $115-135B in AI capex. The “invest in infrastructure, cut headcount” pattern continues.
OpenAI launched a Safety Fellowship for independent AI safety research, running September 2026 through February 2027.
France accelerates digital sovereignty: the state is reducing extra-European technology dependencies across government agencies, context for European AI infrastructure decisions.
Systems thinking for CLAUDE.md: stop writing your agent config like a README. Write it like a failure log. Where is the agent repeating mistakes? Where are you losing time re-explaining context? Where can you constrain upfront? Your AGENTS.md should encode your failure modes, not your folder structure.
GLM-5.1 from Zhipu AI is pitched for long-horizon autonomous engineering, supporting thousands of tool calls over hours-long traces. Open models competing on agent endurance, not just benchmarks.
Microsoft’s Agent Governance Toolkit (open-source) simplifies compliance for autonomous agents under the EU AI Act, HIPAA, and SOC 2.

Articles

AI Weekly Review - Apr. 13th 2026

Highlight of the week

Models and research

Coding agents and dev tools

Web development and frameworks

Industry and business

Interesting GitHub repositories

Quick bits

Sources

Articles

​Highlight of the week

​Models and research

​Coding agents and dev tools

​Web development and frameworks

​Industry and business

​Interesting GitHub repositories

​Quick bits

​Sources

Highlight of the week

Models and research

Coding agents and dev tools

Web development and frameworks

Industry and business

Interesting GitHub repositories

Quick bits

Sources