Building an AI code review agent for our self-hosted GitLab

AI code review tools are everywhere now. GitHub has Copilot reviews, there’s CodeRabbit, Qodo, Greptile, and something new every other week. If you’re on GitHub with a standard stack, you’re spoiled for choice. We’re not on GitHub. We run self-hosted GitLab, track issues in Linear, and have the kind of internal setup that makes off-the-shelf integrations difficult. So I built our own review agent. It’s been running on 25 projects, has processed around 1000 merge requests (each with ~2 reviews on average), and costs about 50 cents per review. Here’s what I learned.

The starting point

Six months earlier, a colleague had already built a working AI review system for our GitLab. It ran as a manual CI job: Claude Code with access to the git diff, commit log, and a prompt. Under 200 lines of code. It worked, teams used it, and it proved that AI reviews were worth doing. The limitation was scope. It could see what changed, but not the surrounding code, the rest of the codebase, or any external context. That was a deliberate choice: it was simple and fast. But I wanted to see what would happen if the agent could do what good human reviewers do: search the codebase, read related files, check the issue tracker, and understand the broader context of a change. Sean Goedecke’s blog post on code reviews was part of the inspiration.

Why custom?

Our engineering stack includes our self-hosted GitLab, Linear for issue tracking, and plenty of other internal systems and tools. If you’re buying an external review tool, you need it to work with all of those, and most don’t. Building our own means the agent can obtain Linear issue details, apply review standards specific to each project, and use GitLab’s CLI, web searches and other tools to go beyond the local checkout. Each integration makes the reviews noticeably better, and each one would be difficult or impossible with an off-the-shelf tool.

Vibe coding a large Python project

I have normally avoided Python. I have written scripts, but building a significant async service would not be something I would normally have attempted. Recent model advances changed this. Claude Opus 4.5 arrived in late November 2025, and with it a level of coding ability that made working in an unfamiliar language feel productive rather than frustrating. I started this project in January. More recently, Opus 4.6 improved things further. The Claude Agent SDK (released September 2025) provided the agentic tooling to make the workflow practical. In the first two weeks, the project grew to about 25K lines of application code and 22K lines of tests. It handles webhook events from GitLab and Linear, classifies requests, triggers CI pipelines, runs Claude agents with sandboxed tool access, and posts results back to GitLab comments and Linear agent sessions. There are broadly four agent task types (review, conversation, planning, and coding), each with different prompts, tool permissions and security constraints. I really like the phrase “vibe coding”, but I don’t mean it to say that there is no oversight or judgement. I made the architectural and UX decisions, read what the AI thought and wrote, and debugged when things broke. But the barrier to working in Python dropped so far that the language choice became almost entirely irrelevant (I only picked it because of the Claude Agent SDK). Some of the more complex features illustrate this well. Incremental reviews - where the agent remembers what it said on a previous review, sees only what changed since, resolves its own comments when issues are fixed, and handles force-push and rebase correctly - involve fiddly commit and diff tracking. It’s the kind of thing I’d normally spend days (or longer) getting right, which I couldn’t afford on an internal, experimental tool. With AI assistance, I focused on the design (what state to track, when to resolve, how to handle rebases) and the AI handled the implementation details. This part did actually take more than one attempt to get right, but each change took minutes instead of hours.

How the agent works

The review agent is triggered by a webhook. When you open or update a merge request, the agent picks it up automatically. By the time a human reviewer comes to look at the MR, a summary and inline comments are already there. Here’s what happens during a review:

The system launches a review job (in a GitLab CI pipeline)
It clones the repository, fetches the MR description and previous comments, and passes those to Claude in a prompt
The agent will then search the codebase for related code and patterns using its built-in tools
It looks up the linked issue using the Linear MCP for context on intent and requirements
It produces structured JSON output: a summary and inline comments with exact file and line references

Each inline comment has a severity level and optionally a code suggestion that the author can apply with one click in GitLab. This structure makes the reviews predictable and actionable. A few other details worth mentioning: Security. The review agent is read-only; file write and edit tools are blocked. All shell commands run inside a Bubblewrap sandbox with a network allowlist restricted to package registries and our GitLab instance. Credentials are automatically redacted from AI responses before they’re posted. Fast classification. When someone mentions @agent in a comment, a fast classifier (Claude Haiku) categorizes the request: simple question, review request, coding task, etc. Simple questions get answered immediately by Haiku. Complex requests get routed to Opus via the CI pipeline. This keeps the cheap things cheap. Incremental reviews. On subsequent pushes, the agent compares against its previous review. It knows what changed, avoids repeating itself, and resolves its own discussion threads when the author fixes an issue. This works even after force-push and rebase by tracking the relevant commit SHAs.

The numbers

Metric	Value
Projects using the agent	25
Merge requests reviewed	1000+
Average reviews per MR	~2
Cost per review	~$0.50
Cost per MR	~$1.00
Model	Claude Opus 4.6

Adoption was opt-in. Teams enable it with a single command that installs a webhook on their project. The agent also has an experimental coding mode: it can implement features from Linear issues, including writing detailed plans, creating branches, and opening merge requests. It has already written code to improve itself (when directed). I still need to gain more confidence in this workflow before it can be considered stable.

What I learned

Agents do a lot better with tools. When Claude can search the codebase, read related files, and check external context, it produces reviews that are qualitatively different from a prompt-with-a-diff. The agent finds issues that aren’t visible in the diff alone: inconsistencies with patterns used elsewhere, missing error handling that only matters in context, and changes that contradict the MR description or the linked issue. Internal tools are an ideal project for vibe coding. The stakes are low: if something breaks, you fix it, and you’re very close to the users. Iteration is fast. And you end up with something that fits your workflow exactly, which no purchased tool will do. While building the review agent, I worked on several other internally focused projects including an AI gateway for colleagues and several smaller tools. The pattern holds: if it’s internal and you understand the requirements, AI-assisted development is remarkably productive. Making reviews consistently useful is the hard problem. The agent produces accurate and insightful-looking reviews most of the time. But it still sometimes flags trivial issues, and very occasionally hallucinates (such as claiming “Go 1.25 does not exist”). Improving the signal-to-noise ratio is ongoing prompt engineering work, and it’s work I can focus on now that the infrastructure is in place. The agent’s coordination system runs, of course, on Upsun, and for now the agent SDK tasks run on GitLab CI. In the process of building this I have learned a lot about what I would want from an autonomous agent system and I’m hoping to explore that further.

​The starting point

​Why custom?

​Vibe coding a large Python project

​How the agent works

​The numbers

​What I learned