This picks up where tutorial 01 left off. If you skipped that one, go read the LangChain chatbot tutorial for project setup, security basics, and the base chat interface. We covered prompt injection defense, output filtering, rate limiting, and the streaming frontend. That stuff still applies here, and the chat UI is reused with minor additions (a debug panel for retrieved sources).
The difference: instead of stuffing all documentation into every prompt, we store it in a vector database and pull out only the bits we need. This is RAG (Retrieval-Augmented Generation), and it changes the economics of the whole thing.
The problem with context stuffing
Tutorial 01’s chatbot sends the entire Upsun documentation with every request. That’s around 50,000 tokens. GPT-4o-mini charges 0.15permillioninputtokens,soeachquestioncostsabout0.0075 before the model even starts thinking.
Worse, most of that context is irrelevant. Someone asks “how do I add PostgreSQL?” and we’re paying to send them pages about Ruby deployment, webhook configuration, and DNS setup.
RAG fixes this by retrieving only the chunks that matter:
| Context stuffing | RAG |
|---|
| Tokens per request | ~50K | ~2K |
| Cost per request | $0.0075 | $0.0003 |
| What context? | Everything | Top 5 relevant chunks |
| Can you see what it used? | No | Yes, debug panel |
The math works out to about 25x cheaper per request. You pay upfront for embedding (once, at deploy time) and you need to run Qdrant, but for anything beyond a demo this pays for itself quickly.
How it works
Two phases:
At deploy time, we parse the documentation into chunks, generate embeddings with OpenAI, and store them in Qdrant. This happens once per deployment.
At query time, we embed the user’s question, search Qdrant for similar chunks, and inject those into the prompt. The model only sees relevant context.
The debug panel shows exactly which chunks were retrieved and how similar they were to the query. When the bot gets something wrong, you can see whether it had the right context or not.
Running Qdrant on Upsun
Qdrant isn’t a managed service on Upsun. You deploy it as a composable application alongside your Node.js app.
Create a qdrant/ directory with a config file:
View source on GitHub
# qdrant/config.yaml
log_level: ERROR
storage:
storage_path: ./storage
snapshots_path: ./snapshots
on_disk_payload: true
wal:
wal_capacity_mb: 32
hnsw_index:
m: 16
ef_construct: 100
full_scan_threshold_kb: 10000
service:
host: 0.0.0.0
http_port: 8888
enable_cors: true
telemetry_disabled: true
Port 8888 is what Upsun expects for HTTP traffic. The storage paths point to mounted directories so data survives redeployments.
Then update .upsun/config.yaml:
View source on GitHub
applications:
chatbot:
type: "nodejs:24"
# ... your existing config ...
relationships:
qdrant: "qdrant:http"
qdrant:
type: "composable:25.11"
container_profile: HIGH_MEMORY
stack:
packages:
- "qdrant"
source:
root: "qdrant"
web:
commands:
start: "qdrant --config-path /app/config.yaml"
mounts:
"storage":
source: "storage"
source_path: "storage"
"snapshots":
source: "storage"
source_path: "snapshots"
The HIGH_MEMORY profile gives Qdrant more RAM for vector operations. The relationship exposes the internal hostname and port through PLATFORM_RELATIONSHIPS.
Chunking that doesn’t suck
Most RAG tutorials show naive chunking: split every 1000 characters, add some overlap, call it a day. This produces garbage. A chunk might start mid-paragraph about PostgreSQL and end mid-sentence about Redis. Good luck getting useful retrieval from that.
We chunk by markdown structure instead. Each chunk is a complete section under a heading, and it carries the heading hierarchy as a breadcrumb:
┌─────────────────────────────────────────────────────────────┐
│ Breadcrumb: "PostgreSQL > Configuration > Relationships" │
│ │
│ ## Relationships │
│ To connect your app to PostgreSQL, define a relationship: │
│ ```yaml │
│ applications: │
│ myapp: │
│ relationships: │
│ database: "db:postgresql" │
│ ``` │
└─────────────────────────────────────────────────────────────┘
Code blocks stay intact. The breadcrumb gets embedded along with the content, so queries like “PostgreSQL relationships” match the hierarchy, not just random text.
The parsing logic tracks code fences to avoid treating # inside code as headings:
function parseMarkdownSections(content: string): MarkdownSection[] {
const lines = content.split("\n");
const sections: MarkdownSection[] = [];
let currentSection: MarkdownSection | null = null;
let inCodeBlock = false;
for (let i = 0; i < lines.length; i++) {
const line = lines[i];
if (line.startsWith("```")) {
inCodeBlock = !inCodeBlock;
}
const headingMatch = !inCodeBlock && line.match(/^(#{1,6})\s+(.+)$/);
if (headingMatch) {
if (currentSection) {
sections.push(currentSection);
}
currentSection = {
level: headingMatch[1].length,
heading: headingMatch[2].trim(),
content: "",
startLine: i,
};
} else if (currentSection) {
currentSection.content += line + "\n";
}
}
return sections;
}
When you hit an H2 under an H1, the breadcrumb becomes “H1 Title > H2 Title”. This context helps retrieval and makes debugging much easier.
The embedding script
Install the dependencies:
pnpm add @qdrant/js-client-rest openai
Create scripts/embed-docs.ts. The interesting parts:
View source on GitHub
Reading connection info from Upsun’s relationship:
function getQdrantConfig(): { host: string; port: number } {
const relationships = process.env.PLATFORM_RELATIONSHIPS;
if (relationships) {
const decoded = JSON.parse(Buffer.from(relationships, "base64").toString("utf-8"));
if (decoded.qdrant?.[0]) {
return {
host: decoded.qdrant[0].host,
port: decoded.qdrant[0].port,
};
}
}
return {
host: process.env.QDRANT_HOST || "localhost",
port: Number.parseInt(process.env.QDRANT_PORT || "6333", 10),
};
}
Generating embeddings in batches:
async function embedBatch(openai: OpenAI, texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data.map((d) => d.embedding);
}
Storing in Qdrant with metadata:
const points = batch.map((chunk, idx) => ({
id: totalUpserted + idx,
vector: embeddings[idx],
payload: {
file_path: chunk.filePath,
url: chunk.url,
title: chunk.title,
breadcrumb: chunk.breadcrumb,
content: chunk.content,
},
}));
await qdrant.upsert(COLLECTION_NAME, { wait: true, points });
The payload metadata comes back with search results. The breadcrumb shows up in the debug panel.
Run embedding in the deploy hook:
hooks:
deploy: |
set -e
pnpm embed
This runs after Qdrant is available but before the web server starts.
Retrieval
Create src/qdrant.ts:
View source on GitHub
export async function searchDocs(query: string): Promise<SearchResult[]> {
const queryEmbedding = await embedQuery(query);
const results = await qdrantClient.search(COLLECTION_NAME, {
vector: queryEmbedding,
limit: 5,
score_threshold: 0.3,
with_payload: true,
});
return results.map((r) => ({
url: String(r.payload?.url),
title: String(r.payload?.title),
breadcrumb: String(r.payload?.breadcrumb),
content: String(r.payload?.content),
score: r.score,
}));
}
limit: 5 gets the top 5 matches. More isn’t better; you just dilute relevance and burn tokens. score_threshold: 0.3 filters out garbage. Below that, chunks are probably noise.
Update the chat to use retrieval:
async *stream(message: string, history?: Array<...>) {
const retrievedDocs = await searchDocs(message);
// Send sources before response starts
const sources = retrievedDocs.map((doc) => ({
url: doc.url,
breadcrumb: doc.breadcrumb,
score: doc.score,
snippet: doc.content.slice(0, 200) + "...",
}));
yield { type: "sources", data: sources };
const systemPrompt = buildSystemPrompt(retrievedDocs);
// ... rest of chat logic
}
Sources are yielded first so the frontend can show them while the response streams in.
The debug panel
The frontend handles a new SSE event type:
if (line.startsWith("event: ")) {
currentEventType = line.slice(7).trim();
}
if (currentEventType === "sources") {
renderSources(wrapper, JSON.parse(data));
}
It looks like this:
┌─────────────────────────────────────────────────────────────┐
│ Retrieved from Qdrant (5 sources) │
├─────────────────────────────────────────────────────────────┤
│ PostgreSQL > Configuration > Relationships 67.7% │
│ To connect your app to PostgreSQL, define a... │
├─────────────────────────────────────────────────────────────┤
│ PostgreSQL > Usage example 65.2% │
│ Use the steps and sample code below if your... │
└─────────────────────────────────────────────────────────────┘
When the bot gives a wrong answer, check this panel. Either the right chunk wasn’t retrieved (retrieval problem) or it was but the model ignored it (prompting problem). Different fixes for each.
Local testing
Start Qdrant:
docker run -d --name qdrant -p 6333:6333 qdrant/qdrant
Clone docs and embed:
git clone --depth 1 https://github.com/platformsh/platformsh-docs.git docs
pnpm embed
Takes a few minutes. You’ll see:
[embed] Found 291 markdown files
[embed] Created 2015 semantic chunks
[embed] Chunk sizes: min=200, avg=918, max=2000 chars
[embed] Batch 1/41 (50 chunks)...
...
[embed] Complete! 2015 semantic chunks indexed
Start the server:
Open http://localhost:3000 and try a question. Watch the debug panel.
Deploying
The deploy hook runs embedding against the Qdrant app. Watch the logs:
You’ll see embedding progress, then the chatbot starts.
When retrieval goes wrong
Check the debug panel first.
If scores are low (below 0.5), the query doesn’t match well. You might need synonyms in your chunks, query expansion, or a lower threshold (but watch for noise creeping in).
If the wrong chunks come back, semantic similarity isn’t matching intent. Try smaller chunks, hybrid search (vectors plus keywords), or a re-ranker.
If the right chunks come back but the answer is still wrong, that’s a prompting problem, not retrieval. The model has the context and is misinterpreting it.
Chunk size
Smaller chunks (500 chars) give more precise retrieval but less context per chunk. You need more of them to cover a topic.
Larger chunks (2000 chars) have more context but less precise matching. They might include irrelevant content that confuses the model.
Current settings (min 200, max 2000, target 1200) work for technical docs. Dense content might want smaller. Narrative content might want larger.
What’s next
For production you’d probably want hybrid search (vectors plus BM25 keywords), re-ranking with a cross-encoder, embedding caches for common queries, and incremental updates instead of re-embedding everything on each deploy.
You’d also want evaluation: a test set of questions with expected sources so you can measure retrieval quality and catch regressions.
Links
Last modified on March 10, 2026