Efficient code analysis for LLMs

AI agents excel at understanding large codebases in depth. But sometimes you don’t need depth, and you don’t want to wait so long. We wanted a way to get a quick overview of a repository, so we made Whatsun. It’s an open-source tool that reads the structure and dependencies in a codebase of any size, producing a very concise summary. It can be used as a CLI or a Go library. Whatsun does not itself use AI, so it’s fast, predictable, and secure. But that speed is exactly why it works well with AI: it handles the quick structural analysis upfront, saving the AI from slower and more expensive processing. This helps to power our AI-assisted configuration feature, which is available through the CLI’s upsun init command for local code, and which you can also try on the web for a public GitHub repository. Of course, we also used AI to help build Whatsun itself.

The problem

At Upsun we support applications written in many different languages, built with many different tools. These diverse applications can coexist in the same repository as part of a single Upsun project, and this complexity is why we are leveraging AI to help with configuration. Our AI context for an Upsun project started with a system prompt, a tree and some documentation. As we evaluated the AI’s results, our approach to the context evolved, and we found we needed more precise information that was able to vary according to the type of project. Another approach would be to ask the LLM to fetch the context it needs, giving it tools to read anything it likes. This can work beautifully in coding agents such as Claude Code, but it did not suit our needs: it would take much longer, cost too much (requiring large reasoning models), and expose too much unnecessary information. Our AI prompt now includes Whatsun’s output, alongside conditional context retrieved automatically based on Whatsun’s findings, such as framework-specific documentation.

The digest

Whatsun produces a digest of a repository, in three parts:

A file tree that limits detail progressively, minimizing tokens. The tree respects .gitignore, which greatly improves performance as it avoids the need to traverse unnecessary directories (like node_modules).
Reports showing detected frameworks, build tools, and package managers. The reports are generated based on configurable rules, explained below.
Selected file contents from important files like README.md, AGENTS.md, or docker-compose.yml. This will also include files specific to certain findings: for example, compose.yaml will be included if Symfony is detected. The contents are limited to the first 2 KB.

The result is a succinct snapshot with the level of detail that we find most helpful. For example, below is the reports section for our demo project. It shows that the project contains a Flask backend and a React frontend, managed by uv and bun.

reports:
    .:
        - result: bun
          ruleset: package_managers
          groups: [js]
    backend:
        - result: flask
          ruleset: frameworks
          groups: [python]
        - result: uv
          ruleset: package_managers
          groups: [python]
    frontend:
        - result: express
          metadata: {version: 4.21.2}
          ruleset: frameworks
          groups: [js]
        - result: reactjs
          metadata: {version: 18.3.1}
          ruleset: frameworks
          groups: [js]
        - result: bun
          ruleset: package_managers
          groups: [js]

Declarative rules

As we were exploring what this tool could do, we wanted it to be simple to configure, but we didn’t want to restrict its potential. We chose Common Expression Language (CEL), which lets you write rules as configuration. Each rule is evaluated on every directory of the codebase and then may contribute a result. Here is an example rule:

django:
  when: fs.depExists("python", "django")
  then: django

The when clause is a CEL expression. In this case, in each directory, it invokes a dependency manager function to check whether django is required as a Python dependency. The use of other web frameworks in a codebase can be harder to detect. They may be composed of various packages that may or may not indicate use of the framework, such as Symfony’s flexible components and libraries. A ‘framework’ may even be used without any visible presence in the repository, in the case of static site generators. Or it may leave a file as a clue:

when: fs.fileExists("hugo.toml") || fs.fileExists("hugo.yaml") || fs.fileExists("hugo.json") || fs.fileExists(".hugo_build.lock")
then: hugo

The rules-based system makes this very flexible without extra Go code. In theory, other sets of rules for other kinds of analysis could be added in future or provided by library callers.

Multilingual dependency detection

Whatsun parses package manager manifests for nine languages to get an overview of dependencies: Go, JavaScript, Python, PHP, Ruby, Rust, Java, .NET, and Elixir. Some of these, such as JavaScript and Python, each have numerous package managers. Fortunately, adding new integrations is a particularly good use case for AI, for research and generating code and tests.

Security and privacy

Upsun requires a security and privacy review of new features, and of course AI-powered features are no exception. A Git repository stores code, and should never usually contain secrets nor personal information, but it remains a possibility. Such data would not be of any use for understanding the code, and it should not be sent to an AI vendor. Whatsun avoids this using multiple layers of protection. Firstly, it respects developer intent: .aiignore and .aiexclude files can be used to declare what should not be analyzed. Unfortunately, this isn’t a common standard yet, but the former is supported by JetBrains’s Junie and the latter by Google’s Gemini. As mentioned above, Whatsun also reads and respects .gitignore files, for both privacy and efficiency. Secondly, if a file’s content is included, Whatsun sanitizes it. It uses Gitleaks for secret detection, redacting API keys and credentials. It also redacts email addresses to remove personally identifiable information, detects and skips binary files, and strips comments to avoid leaking internal notes.

Performance

Speed is a priority for Whatsun’s user experience, scalability, and ease of development. Whatsun caches the configured CEL rules during build, so that they do not need to be recompiled. It traverses directories in parallel, and then executes rules in parallel, ensuring directory contents are cached between steps to avoid unnecessary stat() calls. It operates on Go’s io/fs virtual filesystem, meaning it can process a Git repository cloned in memory or on disk in the same way. The resulting digest is designed to minimize tokens, which helps in a few ways: a faster response, less context confusion, and lower costs.

Caveats

Whatsun is built for the 80% case: surface-level understanding, not deep analysis. It does not provide the full-file context that an AI would need to edit code. And while the rules cover quite a few cases, they would require quite a bit of maintenance to become or remain comprehensive.

Try it

Whatsun is open-source and available on GitHub. If you are building an AI developer tool, you might find Whatsun useful for enhancing context. Or you can download the whatsun CLI, run it on your project, and see what it produces:

go install github.com/upsun/whatsun/cmd/whatsun@latest

We’d be glad to hear what you think, and contributions are very welcome.

​The problem

​The digest

​Declarative rules

​Multilingual dependency detection

​Security and privacy

​Performance

​Caveats

​Try it