Building a RAG pipeline with Qdrant on Upsun

This picks up where tutorial 01 left off. If you skipped that one, go read the LangChain chatbot tutorial for project setup, security basics, and the base chat interface. We covered prompt injection defense, output filtering, rate limiting, and the streaming frontend. That stuff still applies here, and the chat UI is reused with minor additions (a debug panel for retrieved sources). The difference: instead of stuffing all documentation into every prompt, we store it in a vector database and pull out only the bits we need. This is RAG (Retrieval-Augmented Generation), and it changes the economics of the whole thing.

The problem with context stuffing

Tutorial 01’s chatbot sends the entire Upsun documentation with every request. That’s around 50,000 tokens. GPT-4o-mini charges

0.15 per million input tokens, so each question costs about

0.0075 before the model even starts thinking. Worse, most of that context is irrelevant. Someone asks “how do I add PostgreSQL?” and we’re paying to send them pages about Ruby deployment, webhook configuration, and DNS setup. RAG fixes this by retrieving only the chunks that matter:

	Context stuffing	RAG
Tokens per request	~50K	~2K
Cost per request	$0.0075	$0.0003
What context?	Everything	Top 5 relevant chunks
Can you see what it used?	No	Yes, debug panel

The math works out to about 25x cheaper per request. You pay upfront for embedding (once, at deploy time) and you need to run Qdrant, but for anything beyond a demo this pays for itself quickly.

How it works

Two phases: At deploy time, we parse the documentation into chunks, generate embeddings with OpenAI, and store them in Qdrant. This happens once per deployment. At query time, we embed the user’s question, search Qdrant for similar chunks, and inject those into the prompt. The model only sees relevant context. The debug panel shows exactly which chunks were retrieved and how similar they were to the query. When the bot gets something wrong, you can see whether it had the right context or not.

Running Qdrant on Upsun

Qdrant isn’t a managed service on Upsun. You deploy it as a composable application alongside your Node.js app. Create a qdrant/ directory with a config file: View source on GitHub

# qdrant/config.yaml
log_level: ERROR

storage:
  storage_path: ./storage
  snapshots_path: ./snapshots
  on_disk_payload: true

  wal:
    wal_capacity_mb: 32

  hnsw_index:
    m: 16
    ef_construct: 100
    full_scan_threshold_kb: 10000

service:
  host: 0.0.0.0
  http_port: 8888
  enable_cors: true

telemetry_disabled: true

Port 8888 is what Upsun expects for HTTP traffic. The storage paths point to mounted directories so data survives redeployments. Then update .upsun/config.yaml: View source on GitHub

applications:
  chatbot:
    type: "nodejs:24"
    # ... your existing config ...
    relationships:
      qdrant: "qdrant:http"

  qdrant:
    type: "composable:25.11"
    container_profile: HIGH_MEMORY

    stack:
      packages:
        - "qdrant"

    source:
      root: "qdrant"

    web:
      commands:
        start: "qdrant --config-path /app/config.yaml"

    mounts:
      "storage":
        source: "storage"
        source_path: "storage"
      "snapshots":
        source: "storage"
        source_path: "snapshots"

The HIGH_MEMORY profile gives Qdrant more RAM for vector operations. The relationship exposes the internal hostname and port through PLATFORM_RELATIONSHIPS.

Chunking that doesn’t suck

Most RAG tutorials show naive chunking: split every 1000 characters, add some overlap, call it a day. This produces garbage. A chunk might start mid-paragraph about PostgreSQL and end mid-sentence about Redis. Good luck getting useful retrieval from that. We chunk by markdown structure instead. Each chunk is a complete section under a heading, and it carries the heading hierarchy as a breadcrumb:

┌─────────────────────────────────────────────────────────────┐
│ Breadcrumb: "PostgreSQL > Configuration > Relationships"    │
│                                                             │
│ ## Relationships                                            │
│ To connect your app to PostgreSQL, define a relationship:   │
│ ```yaml                                                     │
│ applications:                                               │
│   myapp:                                                    │
│     relationships:                                          │
│       database: "db:postgresql"                             │
│ ```                                                         │
└─────────────────────────────────────────────────────────────┘

Code blocks stay intact. The breadcrumb gets embedded along with the content, so queries like “PostgreSQL relationships” match the hierarchy, not just random text. The parsing logic tracks code fences to avoid treating # inside code as headings:

function parseMarkdownSections(content: string): MarkdownSection[] {
  const lines = content.split("\n");
  const sections: MarkdownSection[] = [];
  let currentSection: MarkdownSection | null = null;
  let inCodeBlock = false;

  for (let i = 0; i < lines.length; i++) {
    const line = lines[i];

    if (line.startsWith("```")) {
      inCodeBlock = !inCodeBlock;
    }

    const headingMatch = !inCodeBlock && line.match(/^(#{1,6})\s+(.+)$/);

    if (headingMatch) {
      if (currentSection) {
        sections.push(currentSection);
      }
      currentSection = {
        level: headingMatch[1].length,
        heading: headingMatch[2].trim(),
        content: "",
        startLine: i,
      };
    } else if (currentSection) {
      currentSection.content += line + "\n";
    }
  }

  return sections;
}

When you hit an H2 under an H1, the breadcrumb becomes “H1 Title > H2 Title”. This context helps retrieval and makes debugging much easier.

The embedding script

Install the dependencies:

pnpm add @qdrant/js-client-rest openai

Create scripts/embed-docs.ts. The interesting parts: View source on GitHub Reading connection info from Upsun’s relationship:

function getQdrantConfig(): { host: string; port: number } {
  const relationships = process.env.PLATFORM_RELATIONSHIPS;
  if (relationships) {
    const decoded = JSON.parse(Buffer.from(relationships, "base64").toString("utf-8"));
    if (decoded.qdrant?.[0]) {
      return {
        host: decoded.qdrant[0].host,
        port: decoded.qdrant[0].port,
      };
    }
  }
  return {
    host: process.env.QDRANT_HOST || "localhost",
    port: Number.parseInt(process.env.QDRANT_PORT || "6333", 10),
  };
}

Generating embeddings in batches:

async function embedBatch(openai: OpenAI, texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return response.data.map((d) => d.embedding);
}

Storing in Qdrant with metadata:

const points = batch.map((chunk, idx) => ({
  id: totalUpserted + idx,
  vector: embeddings[idx],
  payload: {
    file_path: chunk.filePath,
    url: chunk.url,
    title: chunk.title,
    breadcrumb: chunk.breadcrumb,
    content: chunk.content,
  },
}));

await qdrant.upsert(COLLECTION_NAME, { wait: true, points });

The payload metadata comes back with search results. The breadcrumb shows up in the debug panel. Run embedding in the deploy hook:

hooks:
  deploy: |
    set -e
    pnpm embed

This runs after Qdrant is available but before the web server starts.

Retrieval

Create src/qdrant.ts: View source on GitHub

export async function searchDocs(query: string): Promise<SearchResult[]> {
  const queryEmbedding = await embedQuery(query);

  const results = await qdrantClient.search(COLLECTION_NAME, {
    vector: queryEmbedding,
    limit: 5,
    score_threshold: 0.3,
    with_payload: true,
  });

  return results.map((r) => ({
    url: String(r.payload?.url),
    title: String(r.payload?.title),
    breadcrumb: String(r.payload?.breadcrumb),
    content: String(r.payload?.content),
    score: r.score,
  }));
}

limit: 5 gets the top 5 matches. More isn’t better; you just dilute relevance and burn tokens. score_threshold: 0.3 filters out garbage. Below that, chunks are probably noise. Update the chat to use retrieval:

async *stream(message: string, history?: Array<...>) {
  const retrievedDocs = await searchDocs(message);

  // Send sources before response starts
  const sources = retrievedDocs.map((doc) => ({
    url: doc.url,
    breadcrumb: doc.breadcrumb,
    score: doc.score,
    snippet: doc.content.slice(0, 200) + "...",
  }));
  yield { type: "sources", data: sources };

  const systemPrompt = buildSystemPrompt(retrievedDocs);
  // ... rest of chat logic
}

Sources are yielded first so the frontend can show them while the response streams in.

The debug panel

The frontend handles a new SSE event type:

if (line.startsWith("event: ")) {
  currentEventType = line.slice(7).trim();
}

if (currentEventType === "sources") {
  renderSources(wrapper, JSON.parse(data));
}

It looks like this:

┌─────────────────────────────────────────────────────────────┐
│ Retrieved from Qdrant (5 sources)                           │
├─────────────────────────────────────────────────────────────┤
│ PostgreSQL > Configuration > Relationships         67.7%   │
│ To connect your app to PostgreSQL, define a...             │
├─────────────────────────────────────────────────────────────┤
│ PostgreSQL > Usage example                          65.2%   │
│ Use the steps and sample code below if your...             │
└─────────────────────────────────────────────────────────────┘

When the bot gives a wrong answer, check this panel. Either the right chunk wasn’t retrieved (retrieval problem) or it was but the model ignored it (prompting problem). Different fixes for each.

Local testing

Start Qdrant:

docker run -d --name qdrant -p 6333:6333 qdrant/qdrant

Clone docs and embed:

git clone --depth 1 https://github.com/platformsh/platformsh-docs.git docs
pnpm embed

Takes a few minutes. You’ll see:

[embed] Found 291 markdown files
[embed] Created 2015 semantic chunks
[embed] Chunk sizes: min=200, avg=918, max=2000 chars
[embed] Batch 1/41 (50 chunks)...
...
[embed] Complete! 2015 semantic chunks indexed

Start the server:

pnpm dev

Open http://localhost:3000 and try a question. Watch the debug panel.

Deploying

upsun push

The deploy hook runs embedding against the Qdrant app. Watch the logs:

upsun activity:log

You’ll see embedding progress, then the chatbot starts.

When retrieval goes wrong

Check the debug panel first. If scores are low (below 0.5), the query doesn’t match well. You might need synonyms in your chunks, query expansion, or a lower threshold (but watch for noise creeping in). If the wrong chunks come back, semantic similarity isn’t matching intent. Try smaller chunks, hybrid search (vectors plus keywords), or a re-ranker. If the right chunks come back but the answer is still wrong, that’s a prompting problem, not retrieval. The model has the context and is misinterpreting it.

Chunk size

Smaller chunks (500 chars) give more precise retrieval but less context per chunk. You need more of them to cover a topic. Larger chunks (2000 chars) have more context but less precise matching. They might include irrelevant content that confuses the model. Current settings (min 200, max 2000, target 1200) work for technical docs. Dense content might want smaller. Narrative content might want larger.

What’s next

For production you’d probably want hybrid search (vectors plus BM25 keywords), re-ranking with a cross-encoder, embedding caches for common queries, and incremental updates instead of re-embedding everything on each deploy. You’d also want evaluation: a test set of questions with expected sources so you can measure retrieval quality and catch regressions.

​The problem with context stuffing

​How it works

​Running Qdrant on Upsun

​Chunking that doesn’t suck

​The embedding script

​Retrieval

​The debug panel

​Local testing

​Deploying

​When retrieval goes wrong

​Chunk size

​What’s next

​Links