Building internal AI agents for our teams on Upsun

At Upsun, we believe the best way to understand a technology is to build something real with it. That’s why we created AI-Brain, a multiple AI agents system designed to help our teams to create high-quality technical content and understand our product better. This article is the story of how we built it, why we made specific architectural decisions, and what we learned along the way. More importantly, this article shows you how to build your own internal AI agents on Upsun. Agents that understand your domain, follow your guidelines, and produce consistent, high-quality outputs. The article will focus on building a content writer agent for a Marketing team but the concepts can be applied to any other role.

The challenge: Beyond generic AI responses

Generic AI assistants are impressive, but they fall short when you need domain-specific expertise. Ask ChatGPT to write a technical article about your company, and you’ll get something generic—accurate in structure perhaps, but lacking the nuance, brand voice, and technical depth your team needs. The problem isn’t the model’s capability, it’s the context. Large language models don’t inherently know about your company’s:

Technical documentation and best practices
Brand voice and messaging guidelines
Content templates and formatting standards
Product-specific features and terminology
Team personas and target audiences

We needed an agent that could write content and produce answers to questions as if it were a member of our team. Someone who had read all our documentation, understood our brand voice, and could reference our existing articles for technical accuracy.

The solution: Context building, not prompt engineering

Our key insight was this: building better agents isn’t about writing better prompts, it’s about building better context. Traditional prompt engineering tries to cram everything into the user’s message:

# Traditional approach - everything in the prompt
user_message = f"""
Write an article about Redis caching.

Here are our brand guidelines: {50_pages_of_guidelines}
Here are code examples: {hundreds_of_examples}
Here's our style guide: {detailed_style_guide}
Now write the article...
"""

This approach has fundamental problems:

Token limits: You hit model context windows quickly or encounter context rotting issues
Cost: Every query pays for the same massive context
Maintenance: Updating guidelines means updating every prompt
Relevance: Most of that context isn’t relevant to the specific request

Instead, we built a system that separates static context from dynamic context:

Static Knowledge: Injected once into the agent’s instructions—brand voice, writing standards, personas
Dynamic Retrieval: Retrieved on-demand using RAG (Retrieval-Augmented Generation)—relevant documentation, code examples, technical details

This hybrid approach gives us the best of both worlds: consistency from static context and freshness from dynamic retrieval.

Architecture overview: The big picture

Before diving into implementation details, let’s understand the system architecture: The system has five main components:

Document Ingestion: Daily cron job that syncs Git repositories, processes markdown files, and updates embeddings in ChromaDB
ChromaDB Vector Database: Stores document embeddings for semantic search
RAG Query Engine: Retrieves relevant content using vector similarity search
Google ADK Agent: Orchestrates responses using static knowledge and dynamic retrieval
User Interfaces: Web and CLI interfaces for interaction

Let’s explore each component in detail.

Component 1: ChromaDB Vector Database

ChromaDB is an open-source embedding database designed for building AI applications with semantic search capabilities. Think of it as a specialized database that stores text alongside vector representations (embeddings) that capture semantic meaning.

Why vector databases?

Traditional keyword search fails with semantic queries:

Query: “How do I cache data for better performance?”
Keyword match: Looks for exact words like “cache”, “data”, “performance”
Semantic match: Understands concepts like caching, Redis, performance optimization, CDNs

Vector databases solve this by:

Converting text to high-dimensional vectors (embeddings) that capture meaning
Finding similar content using vector similarity (cosine distance)
Returning results ranked by semantic relevance

ChromaDB Setup on Upsun

We deploy ChromaDB as a separate application on Upsun. Here’s the configuration from our .upsun/config.yaml:

applications:
  chroma:
    type: "python:3.12"
    source:
      root: "chroma"

    dependencies:
      python3:
        uv: "*"

    hooks:
      build: |
        uv init
        uv add chromadb

    web:
      commands:
        start: "uv run --no-sync chroma run --host 0.0.0.0 --port $PORT --path /app/.db"

    mounts:
      ".db":
        source: "storage"
        source_path: "db"
      ".chroma":
        source: "storage"
        source_path: "chroma"

Key aspects of this configuration:

Python 3.12: ChromaDB requires Python 3.11+
UV package manager: Fast, modern Python dependency management
Persistent storage: The mounts section ensures embeddings survive deployments
Dynamic port: Upsun automatically assigns the $PORT variable

Document schema

Each document chunk in ChromaDB has this structure:

{
    "id": "a1b2c3d4e5f6g7h8",  # MD5 hash (16 chars)
    "document": "Actual text content of the chunk...",
    "embedding": [0.123, -0.456, ...],  # 1536-dimensional vector
    "metadata": {
        # File metadata
        "file_path": "/path/to/documents/repo/file.md",
        "file_name": "redis-caching.md",
        "directory": "posts",

        # Repository metadata
        "repository": "upsun-devcenter",
        "relative_path": "dev/content/posts/redis-caching.md",

        # Frontmatter metadata (from YAML)
        "title": "Optimizing Performance with Redis Caching",
        "description": "Learn how to implement Redis caching...",
        "date": "2024-01-01",
        "author": "Platform Team",
        "tags": "performance,redis,caching",

        # Chunk metadata
        "chunk_index": 0,
        "total_chunks": 3,
        "chunk_token_count": 1200,

        # Ingestion metadata
        "ingestion_timestamp": "2024-01-15T10:30:00"
    }
}

This rich metadata allows for adding criterias when querying the database:

Filtering: “Only search documentation, not blog posts”
Attribution: “This content came from article X”
Recency: “Prioritize recently updated content”

Component 2: Document ingestion system

The ingestion system is responsible for keeping ChromaDB synchronized with our Git repositories that hold most of our technical content (documentation, the devcenter, blogs). It runs daily via a Upsun cron job and processes markdown files with intelligent chunking.

The ingestion flow

SSH authentication for private repositories

One challenge was securely accessing private GitHub repositories. We solved this with SSH key authentication:

class GitRepositoryManager:
    def setup_ssh_key_file(self) -> bool:
        """Create temporary SSH key file from environment variable."""
        key_content = os.getenv("GITHUB_KEY").strip()
        key_content = key_content.replace('\\n', '\n')

        # Validate BEGIN/END format
        if not (key_content.startswith('-----BEGIN') and
                key_content.endswith('-----')):
            return False

        # Create temporary file with proper permissions (0400)
        with tempfile.NamedTemporaryFile(mode='w', delete=False) as temp_file:
            temp_file.write(key_content)
            self.temp_key_file = temp_file.name

        os.chmod(self.temp_key_file, stat.S_IRUSR)
        return True

    def clone_or_update_repo(self, repo_config: RepositoryConfig):
        """Clone or pull repository with SSH authentication."""
        git_env = os.environ.copy()
        if self.temp_key_file:
            ssh_command = f'ssh -i {self.temp_key_file} -o StrictHostKeyChecking=no'
            git_env['GIT_SSH_COMMAND'] = ssh_command

        if repo_path.exists():
            # Pull updates and track changes
            repo = git.Repo(repo_path)
            old_commit = repo.head.commit.hexsha

            with repo.git.custom_environment(**git_env):
                origin = repo.remotes.origin
                origin.pull()

            new_commit = repo.head.commit.hexsha
            changed_files = self._get_changed_files(repo, old_commit, new_commit)
        else:
            # Clone new repository
            repo = git.Repo.clone_from(repo_config.url, repo_path, env=git_env)

The SSH key is stored as an environment variable in Upsun and the Git commands use custom environment to specify SSH key. The script tracks commit hashes to detect changes and make sure we are only ingesting new content.

Intelligent text chunking

Not all text chunks are created equal. We use a token-aware chunking strategy that balances context preservation with embedding model limits:

class DocumentProcessor:
    def __init__(self, chunk_size: int = 1500, chunk_overlap: int = 200):
        self.chunk_size = chunk_size  # tokens
        self.chunk_overlap = chunk_overlap  # tokens
        self.tokenizer = tiktoken.get_encoding("cl100k_base")

    def chunk_text(self, text: str, metadata: Dict, file_path: Path) -> List[DocumentChunk]:
        """Split text into overlapping chunks optimized for embeddings."""
        # Tokenize entire document
        tokens = self.tokenizer.encode(text)

        # If shorter than chunk_size, return single chunk
        if len(tokens) <= self.chunk_size:
            return [self._create_single_chunk(text, tokens, metadata, file_path)]

        chunks = []
        chunk_index = 0

        # Create sliding window chunks with overlap
        for start in range(0, len(tokens), self.chunk_size - self.chunk_overlap):
            end = min(start + self.chunk_size, len(tokens))
            chunk_tokens = tokens[start:end]
            chunk_text = self.tokenizer.decode(chunk_tokens)

            # Skip very small end chunks (< 100 tokens)
            if len(chunk_tokens) < 100:
                break

            chunk_id = self._generate_chunk_id(file_path, chunk_index, chunk_text)

            chunks.append(DocumentChunk(
                content=chunk_text,
                metadata={
                    **metadata,
                    'chunk_index': chunk_index,
                    'chunk_token_count': len(chunk_tokens),
                    'chunk_start_token': start,
                    'chunk_end_token': end
                },
                chunk_id=chunk_id,
                file_path=str(file_path),
                chunk_index=chunk_index
            ))

            chunk_index += 1

        return chunks

Chunking strategy explained:

Chunk size: 1500 tokens (~6000 characters)
- Large enough to preserve context
- Small enough for embedding model limits
- Roughly 2-3 paragraphs of technical content
Overlap: 200 tokens (~800 characters)
- Ensures concepts split across chunks remain findable
- Provides context continuity
- Helps with concepts mentioned near chunk boundaries
Token encoding: cl100k_base
- Same tokenizer used by GPT-3.5 and GPT-4
- Ensures accurate token counting
- Prevents embedding model errors

Incremental updates

Efficiency matters. We only process files that have changed:

class ChromaDBManager:
    def replace_file_chunks(self, file_path: str, chunks: List[DocumentChunk]) -> bool:
        """Replace all chunks for a file: delete old, add new."""
        # Step 1: Query for existing chunks
        results = self.collection.query(
            query_texts=["dummy"],
            n_results=1000,
            where={"file_path": file_path},
            include=["metadatas"]
        )

        # Step 2: Delete old chunks
        if results["ids"]:
            self.collection.delete(ids=results["ids"][0])

        # Step 3: Add new chunks with embeddings
        self.collection.upsert(
            documents=[chunk.content for chunk in chunks],
            metadatas=[self._prepare_metadata(chunk.metadata) for chunk in chunks],
            ids=[chunk.chunk_id for chunk in chunks]
        )

        return True

This approach queries ChromaDB for existing chunks from the file and deletes old chunks in a single operation. Then adds new chunks with fresh embeddings and tracks the state in repo_state.json to skip unchanged files

Component 3: RAG (Retrieval-Augmented Generation)

RAG is the secret sauce that makes our agent domain-aware. Instead of relying solely on the model’s training data, we dynamically retrieve relevant information from our vector database.

How RAG Works

Two-Stage RAG Implementation

Our RAG system uses a two-stage approach: Stage 1: Keyword Generation Instead of searching with the raw user query, we use OpenAI to generate optimal search keywords:

def _generate_keywords(self, user_request: str) -> List[str]:
    """Use OpenAI to generate optimal search keywords."""
    response = self.openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """Generate 3-5 specific search keywords for finding
                relevant documentation. Focus on technical terms and concepts."""
            },
            {
                "role": "user",
                "content": user_request
            }
        ]
    )

    # Parse keywords from response
    keywords = self._parse_keywords(response.choices[0].message.content)
    return keywords

For example, the query “How do I cache data?” might generate keywords:

redis caching
upsun cache configuration
performance optimization
cache eviction policies

This improves retrieval quality significantly compared to searching with the raw query. Stage 2: Vector Retrieval For each keyword, we query ChromaDB:

def _retrieve_chunks(self, keyword: str, n_results: int) -> List[Dict]:
    """Query ChromaDB with keyword embedding."""
    results = self.collection.query(
        query_texts=[keyword],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )

    return self._process_results(results)

Under the hood, ChromaDB:

Creates an embedding for the keyword using OpenAI’s text-embedding-3-small
Computes cosine similarity against all document embeddings
Returns the top N most similar chunks
Ranks results by similarity score (distance)

Formatting retrieved context

Raw chunks aren’t usable directly. We format them into structured markdown that the agent can understand:

def _format_as_markdown(self, results: List[Dict]) -> str:
    """Format retrieved chunks as structured markdown."""
    output = "## Retrieved Upsun Content\n\n"

    for result in results:
        output += f"### {result['metadata']['title']}\n"
        output += f"**Source**: {result['metadata']['repository']}\n"
        output += f"**File**: {result['metadata']['file_name']}\n\n"
        output += f"{result['document']}\n\n"
        output += "---\n\n"

    return output

This gives the agent where the information comes from (metadata), the actual content and a clear separation between sources.

RAG in action: An example

Let’s trace a real query through the system: User Query: “Write an article about Redis caching on Upsun” Generated Keywords:

[
    "redis cache configuration",
    "upsun redis service",
    "cache performance optimization",
    "redis best practices"
]

Retrieved Chunks (top 6 across all keywords):

Redis service configuration from platformsh-docs
Caching strategies article from devcenter
Redis performance tuning from blog post
Cache eviction policies documentation
Redis connection pooling example
Real-world Redis implementation case study

Formatted Context (sent to agent):

## Retrieved Upsun Content

### Configuring Redis as a service
**Source**: platformsh-docs
**File**: add-services/redis.md

To add Redis to your Upsun project, add the following to your
`.upsun/config.yaml` file:

[... content continues ...]

---

### Performance optimization with caching
**Source**: upsun-devcenter
**File**: posts/caching-strategies.md

Caching is one of the most effective ways to improve application
performance...

[... content continues ...]

This context is now combined with static knowledge and the user’s request to generate the final response.

Component 4: Static knowledge base

While RAG provides dynamic and relevant content, static knowledge ensures consistency and brand alignment. Our static knowledge lives in two directories:

`/knowledge/` Directory

Contains foundational content injected into every agent interaction. The list of injected files can be customized per agent as they all don’t need the same knowledge. Here are some examples from our AI-Brain internal project:

upsun_writing_guidelines.md: Tone, style, voice
upsun_technical_context.md: Platform capabilities, terminology
content_templates.md: Article structures, formats
personas.md: Target audience personas
icp.md: Ideal customer profile
messaging_positioning.md: Value propositions, key messages
talking_points.md: Common themes and narratives

`/rules/` Directory

Contains operational rules and constraints:

content-creator.md: Role definition and responsibilities
format.md: Formatting requirements
styleguide.md: Language conventions
upsun.md: Additional rules and guidelines about the brand

Loading static knowledge

The agent loads this content once at initialization:

def _load_knowledge_base() -> str:
    """Load all static knowledge files from knowledge/ directory."""
    knowledge_path = Path(__file__).parent.parent.parent.parent.parent / "knowledge"
    knowledge_content = ""

    knowledge_files = [
        "upsun_writing_guidelines.md",
        "upsun_technical_context.md",
        "content_templates.md",
        "personas.md",
        "icp.md",
        "messaging_positioning.md",
        "talking_points.md"
    ]

    for file_name in knowledge_files:
        file_path = knowledge_path / file_name
        if file_path.exists():
            with open(file_path, 'r', encoding='utf-8') as f:
                knowledge_content += f"## {file_name}\n\n{f.read()}\n\n"

    return knowledge_content

This content (~50KB) is loaded once and reused across all queries, providing consistency, efficiency and authority. Every response follows the same guidelines and the agent “knows” company standards inherently.

Component 5: Google ADK Agent

Now we bring everything together with Google’s Agent Development Kit (ADK), a framework for building production-ready AI agents.

Why did we choose Google ADK?

Google ADK stood out for several compelling reasons. First, it’s model-agnostic, meaning we can work with OpenAI, Anthropic, Google, or even local models without changing our core implementation. The framework excels at tool integration, making it trivial to add custom capabilities like our RAG research function. For complex workflows, ADK supports multi-agent orchestration, allowing multiple specialized agents to collaborate seamlessly. We also appreciated the deployment flexibility. The same agent code can be deployed via CLI, web UI, or API without modification. Finally, ADK handles conversation management automatically, taking care of state, history, and context tracking so we can focus on building great agent experiences rather than infrastructure.

Google ADK limits the use of the Google’s web_search tool to gemini models. Fortunately a lot of other tools like tavily can replace this and work with any model.

Agent structure

Here’s the core agent definition:

from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm

# Step 1: Load static knowledge
knowledge_base = _load_knowledge_base()

# Step 2: Initialize ChromaDB connection
chroma_client = create_chromadb_client(
    collection_name="upsun_documents"
)

# Step 3: Create query engine for RAG
query_engine = create_query_engine(chroma_client)

# Step 4: Define RAG research tool
def research_tool(user_request: str) -> str:
    """Research tool that uses RAG to retrieve relevant content."""
    if not query_engine:
        return "## ChromaDB not available\n\nRAG functionality unavailable."

    return query_engine.research_topic(user_request, max_results=6)

# Step 5: Compose instruction with static knowledge
instruction = f"""You are the Dev Center Writer, a specialized technical writing agent
for the Upsun marketing team with advanced RAG capabilities.

KNOWLEDGE BASE:
{knowledge_base}

RAG CAPABILITIES:
You have access to a ChromaDB vector database containing Upsun blog posts, articles,
and documentation. You can use the research_tool to:
1. Generate relevant research keywords using OpenAI based on user requests
2. Query the ChromaDB vector database for related Upsun content
3. Retrieve and integrate relevant information into your responses

Always use the research_tool when you need to find specific information about Upsun
features, best practices, or existing content.

Your primary responsibilities:
1. Create comprehensive technical tutorials for the Upsun DevCenter
2. Write clear, step-by-step implementation guides
3. Develop best practices documentation
4. Ensure content follows Upsun's technical writing standards

When writing content:
- Use retrieved documentation to ensure technical accuracy
- Follow the brand voice and guidelines from your knowledge base
- Structure content according to provided templates
- Include practical code examples and real-world scenarios
- Always validate technical details against retrieved documentation
"""

# Step 6: Create the agent
root_agent = Agent(
    name="devcenter_writer",
    model=LiteLlm(model="anthropic/claude-sonnet-4-5-20250929"),
    description="Specialized technical writing agent for creating Upsun Dev Center content",
    instruction=instruction,
    tools=[research_tool]  # RAG tool available to agent
)

And yes, some parts of this article were written with that agent.

Based on our testing, Claude Sonnet always produce the best results but your experience may vary greatly from one use-case to the other. As we use LiteLLM as an abstraction, it is straightforward to use multiple models based on the agent or even inside one specific agent so models can review and validate each other.

The power of context building

Notice what we’ve achieved here. The agent has:

Static Context (in instruction):
- 50KB of brand guidelines, templates, personas
- Loaded once, available for every query
- Ensures consistency across all responses
Dynamic Context (via research_tool):
- Retrieves relevant docs on-demand
- Only fetches what’s needed for each query
- Provides fresh, accurate technical information
Tool Access (via tools parameter):
- Can call research_tool when needed
- Agent decides when to retrieve additional context
- Autonomous research capability

This separation of concerns is what makes the system maintainable and scalable.

Agent execution flow

When a user makes a request, here’s what happens: The agent autonomously decides when to use the research tool based on the user’s request and its available context.

Deploying on Upsun: Multi-application architecture

Here’s where Upsun’s multi-application support shines. We deploy our system as two independent applications that communicate internally.

Why multi-application?

Separation of concerns: ChromaDB and the agent have different resource needs
Independent scaling: Scale database separately from application
Service isolation: ChromaDB failures don’t affect agent (and vice versa)
Clear boundaries: Each service has its own configuration and storage

Upsun Configuration

Here’s our complete .upsun/config.yaml:

applications:
  # Main AI Agent Application
  app:
    source:
      root: "/"

    type: "python:3.12"

    # UV package manager for fast installs
    dependencies:
      python3:
        uv: "*"

    # Relationship to ChromaDB application
    relationships:
      chroma: chroma:http

    # Build process using UV
    hooks:
      build: |
        # Sync dependencies from uv.lock
        uv sync --frozen

    # Persistent storage for Git repositories
    mounts:
      'documents':
        source: storage
        source_path: documents

    # Daily document ingestion cron job
    crons:
      ingest:
        spec: '0 3 * * *'  # 3 AM UTC daily
        commands:
          start: uv run python ingest_documents.py
          stop: killall python
        shutdown_timeout: 30

    # Web server configuration
    web:
      commands:
        start: "uv run --no-sync adk web src/ai_brain/agents --port=8888"

    # Environment variables
    variables:
      env:
        UV_CACHE_DIR: "/tmp/uv-cache"
        PYTHONPATH: "."

  # ChromaDB Vector Database Application
  chroma:
    type: "python:3.12"

    source:
      root: "chroma"

    dependencies:
      python3:
        uv: "*"

    hooks:
      build: |
        uv init
        uv add chromadb

    web:
      commands:
        start: "uv run --no-sync chroma run --host 0.0.0.0 --port $PORT --path /app/.db"

    # Persistent storage for ChromaDB data
    mounts:
      ".db":
        source: "storage"
        source_path: "db"
      ".chroma":
        source: "storage"
        source_path: "chroma"

    variables:
      env:
        UV_CACHE_DIR: "/tmp/uv-cache"
        PYTHONPATH: "."

# Route configuration
routes:
  "https://{default}/":
    type: upstream
    upstream: "app:http"

Key configuration insights

1. Application Relationships

relationships:
  chroma: chroma:http

This creates an internal network connection between applications. Upsun automatically provides:

CHROMA_HOST: Internal hostname for ChromaDB
CHROMA_PORT: Port number for ChromaDB
No external network calls required
Fast, secure inter-application communication

2. Persistent Storage (Mounts)

mounts:
  'documents':
    source: storage
    source_path: documents

Mounts persist data across deployments:

documents/: Cloned Git repositories
.db/ and .chroma/: ChromaDB data and indexes

Without mounts, every deployment would re-clone repositories and rebuild embeddings, expensive and slow. 3. Cron Jobs for Automation

crons:
  ingest:
    spec: '0 3 * * *'  # 3 AM UTC daily
    commands:
      start: uv run python ingest_documents.py
      stop: killall python
    shutdown_timeout: 30

Automated daily ingestion:

Runs at 3 AM UTC (low-traffic window)
Pulls latest changes from Git
Updates ChromaDB incrementally
Graceful shutdown with timeout

4. UV Package Manager

dependencies:
  python3:
    uv: "*"

hooks:
  build: |
    uv sync --frozen

You can find more information on UV on our article. We use the following arguments:

--frozen flag: Fails if dependencies don’t match lockfile
--no-sync flag: Prevents write attempts at runtime

5. Environment Variables Set via upsun variable:create:

# OpenAI API key for embeddings
upsun variable:create --level environment --name OPENAI_API_KEY --value "sk-..."

# Anthropic API key for embeddings and LLM
upsun variable:create --level environment --name ANTHROPIC_API_KEY --value "..."

# GitHub SSH private key for repository access
upsun variable:create --level environment --name GITHUB_KEY --value "-----BEGIN OPENSSH PRIVATE KEY-----
...
-----END OPENSSH PRIVATE KEY-----"

These are automatically available to both applications and remain secure.

Performance and cost

Real-world performance from our production system:

Ingestion performance

Documents Processed: 243 files
Total Chunks Created: ~1,200 chunks
Processing Time: 8-15 minutes (depends on OpenAI API latency)
Incremental Updates: Only changed files processed (typically 1-5 files/day)
Storage Size: ~5-10 MB for embeddings
Cost: ~$0.10-0.20 per full ingestion (OpenAI embedding costs)

RAG Query Performance

Keyword Generation: ~500ms (OpenAI API)
Vector Similarity Search: ~50-100ms per keyword (ChromaDB)
Total Retrieval Time: ~1-2 seconds for 3 keywords
Results Returned: 6 most relevant chunks (~3,000-5,000 tokens)
Cost: ~$0.01 per query (keyword generation + retrieval)

Agent response performance

Static Knowledge Loading: ~50ms (cached after first load)
RAG Retrieval: ~1-2 seconds
LLM Generation: ~5-15 seconds (depends on output length)
Total Response Time: ~8-20 seconds for complete article
Cost: ~$0.05-0.15 per article (depends on length)

These are acceptable performance characteristics for an internal tool. If you need faster responses, consider several optimization strategies: caching frequent queries to avoid redundant processing, pre-generating embeddings for common topics that are queried regularly, using faster models like GPT-3.5-turbo for keyword generation while keeping higher-quality models for final content generation, and implementing streaming responses to provide incremental results as they’re generated rather than waiting for complete output.

Best practices and lessons learned

After building and deploying this system, here are our key takeaways:

Context building beats prompt engineering

The most important lesson we learned is to separate concerns rather than cramming everything into prompts. Static context like brand guidelines and standards should be injected once in agent instructions, while dynamic context from documentation and examples should be retrieved on-demand via RAG. This clear separation scales to unlimited documentation without hitting token limits, and makes the system far more maintainable than traditional prompt engineering approaches.

Chunking strategy matters

Through experimentation, we found that chunk size dramatically affects retrieval quality. Chunks that are too small (500 tokens) fragment concepts and produce poor results, while chunks that are too large (3000 tokens) lose precision and retrieve irrelevant sections. Our sweet spot of 1500 tokens balances context preservation with search precision. Equally critical is the 200-token overlap between chunks, which prevents concepts split across boundaries from being lost during retrieval.

Two-stage RAG improves quality

While direct query-to-embedding search works, adding keyword generation as a first stage significantly improves results. Using a language model to generate 3-5 targeted search keywords from the user’s request improves recall by finding content even with different phrasing, enables better semantic matching since keywords capture intent more precisely, and produces more focused results through technical terminology.

Rich metadata pays dividends

Investing in rich metadata during ingestion enables powerful filtering capabilities later. You can search only official documentation versus blog posts, prioritize recently updated content, filter by content type like tutorials or guides, or query by tags for specific topics. Don’t skimp on metadata during ingestion—it’s one of the highest-value investments you can make in your RAG system.

Incremental processing saves resources

Full re-ingestion of our 243 documents takes 15-20 minutes and costs $0.10-0.20 in API fees. By implementing incremental updates that only process changed files and track state with Git commit hashes, we typically process just 1-5 files per day, saving 95% of processing time and costs. This makes daily updates practical and economical.

Monitor and iterate

Not all embeddings perform equally well. Continuously monitor similarity scores to understand retrieval relevance, gather user feedback on whether the agent finds the right content, and track false negatives where queries return no useful results. Use this feedback loop to refine your chunking strategy, adjust metadata, and improve keyword generation prompts.

Static knowledge ensures consistency

Without static knowledge injected into the agent’s instructions, outputs vary unpredictably—writing style shifts between responses, brand guidelines get missed, target audience considerations are forgotten, and formatting becomes inconsistent. Static knowledge provides the “personality” and standards that make every output consistent and on-brand, regardless of what dynamic content gets retrieved.

The future of internal AI agents

Building AI-Brain taught us that successful AI agents aren’t about having the best model. They are about building the best context. By separating static knowledge (brand voice, guidelines, standards) from dynamic retrieval (technical documentation, code examples), we created an agent that writes like our team by following brand guidelines consistently, stays technically accurate by retrieving current documentation on demand, scales effortlessly to handle unlimited documentation without hitting token limits, and remains maintainable because knowledge files can be updated independently without touching agent code. This architecture adapts to any domain. For customer support, combine static response guidelines with dynamic knowledge base retrieval. For code generation, pair static coding patterns with dynamic code search across your repositories. For data analysis, merge static methodologies with dynamic data retrieval. For documentation, unite static writing standards with dynamic content search across your docs. The tools are remarkably accessible. Google ADK provides a free, open-source agent framework that handles orchestration. ChromaDB offers a free, open-source vector database for semantic search. OpenAI and Claude operate on pay-per-use models with no infrastructure requirements. Upsun delivers a managed platform that eliminates DevOps complexity entirely. Your next steps:

Identify a use case: What internal task could benefit from AI assistance?
Gather static knowledge: What guidelines should the agent always follow?
Identify dynamic sources: What documentation should be searchable?
Start small: Build a prototype with 10-20 documents
Iterate: Add more documents, refine prompts, improve retrieval
Deploy: Use Upsun’s cloud application platform for production

Resources:

Questions? Join our Discord.

Articles

​The challenge: Beyond generic AI responses

​The solution: Context building, not prompt engineering

​Architecture overview: The big picture

​Component 1: ChromaDB Vector Database

​Why vector databases?

​ChromaDB Setup on Upsun

​Document schema

​Component 2: Document ingestion system

​The ingestion flow

​SSH authentication for private repositories

​Intelligent text chunking

​Incremental updates

​Component 3: RAG (Retrieval-Augmented Generation)

​How RAG Works

​Two-Stage RAG Implementation

​Formatting retrieved context

​RAG in action: An example

​Component 4: Static knowledge base

​/knowledge/ Directory

​/rules/ Directory

​Loading static knowledge

​Component 5: Google ADK Agent

​Why did we choose Google ADK?

​Agent structure

​The power of context building

​Agent execution flow

​Deploying on Upsun: Multi-application architecture

​Why multi-application?

​Upsun Configuration

​Key configuration insights

​Performance and cost

​Ingestion performance

​RAG Query Performance

​Agent response performance

​Best practices and lessons learned

​Context building beats prompt engineering

​Chunking strategy matters

​Two-stage RAG improves quality

​Rich metadata pays dividends

​Incremental processing saves resources

​Monitor and iterate

​Static knowledge ensures consistency

​The future of internal AI agents

The challenge: Beyond generic AI responses

The solution: Context building, not prompt engineering

Architecture overview: The big picture

Component 1: ChromaDB Vector Database

Why vector databases?

ChromaDB Setup on Upsun

Document schema

Component 2: Document ingestion system

The ingestion flow

SSH authentication for private repositories

Intelligent text chunking

Incremental updates

Component 3: RAG (Retrieval-Augmented Generation)

How RAG Works

Two-Stage RAG Implementation

Formatting retrieved context

RAG in action: An example

Component 4: Static knowledge base

`/knowledge/` Directory

`/rules/` Directory

Loading static knowledge

Component 5: Google ADK Agent

Why did we choose Google ADK?

Agent structure

The power of context building

Agent execution flow

Deploying on Upsun: Multi-application architecture

Why multi-application?

Upsun Configuration

Key configuration insights

Performance and cost

Ingestion performance

RAG Query Performance

Agent response performance

Best practices and lessons learned

Context building beats prompt engineering

Chunking strategy matters

Two-stage RAG improves quality

Rich metadata pays dividends

Incremental processing saves resources

Monitor and iterate

Static knowledge ensures consistency

The future of internal AI agents