RAG for Dummies — Query Phase | amarnathresearch.com

OK so you've done the prep work (Part 1). Your documents are chopped up, converted to numbers, and saved in a database. Now someone asks a question.

What happens next is what people call "semantic retrieval with augmented generation" — which is a 6-word way of saying "find the relevant bits, then have the AI read them and answer."

Let's trace a real question through every step.

1

The question

Someone Types a Question.

Nothing fancy. Someone opens the terminal and types a question in plain English. No special syntax, no query language, no SQL.

How many vacation days do I get?

🎯 Notice anything interesting?

The question uses the word "vacation". But the document (from Part 1) uses the word "leave." In a keyword search, this would return zero results. In RAG, it works perfectly. Keep reading to see why.

2

Same trick, different text

Convert the Question to Numbers. Same As Before.

Remember Step 3 from the indexing phase? Where we converted each chunk into 384 numbers? We do the exact same thing to the question. Same model. Same 384 numbers.

Why the same model? Because the numbers need to be in the same space. If you used a different model, the numbers would mean different things — like comparing temperatures in Celsius to Fahrenheit without converting.

Your question → same model → numbers

"How many vacation days...?" → 🧠 MiniLM → [0.019, -0.203, 0.410, ... ×384]

🧠 What the model "understands" from your question

High weight: time-off, employee benefits, quantity, leave policy
Low weight: "I", "do", "get" (common filler words)

The model strips out the noise and encodes the intent. That's why "vacation days" and "paid leave" end up as nearly identical number lists — the intent is the same.

3

The matchmaking

Find the Closest Numbers. That's the Search.

This is what people mean when they say "cosine similarity search in a vector database."

In plain English: We compare the question's 384 numbers against every chunk's 384 numbers and see which ones are most similar. ChromaDB does this in about 5 milliseconds — even with thousands of chunks.

The result? A ranked list. The chunk with the most similar numbers wins.

Your query vs. every stored chunk — ranked by similarity

0.92

"Our company offers 20 days of paid leave..." p4

0.84

"Emergency leave can be requested..." p4

0.71

"Sick leave policy covers up to 15 days..." p6

0.58

"Employee benefits include health insurance..." p2

0.12

"Quarterly revenue reached $2.4M..." ❌ irrelevant

✅ The keyword "vacation" appears NOWHERE in the top result

The top chunk says "paid leave" — not "vacation." But their number-lists point in almost the same direction (0.92 out of 1.0). That's the whole point of embeddings. Meaning, not keywords. This is why someone says "semantic search" — it searches by meaning.

4

The assembly

Stuff the Chunks Into a Prompt. Give It to Llama.

This is what "augmented" means in RAG. We augment the AI's input with the retrieved chunks. We literally paste the chunks into the prompt and say "answer using ONLY this stuff."

People call this "grounding the LLM in retrieved context." You can call it "giving the AI a cheat sheet."

🔒 System instruction

You are a helpful assistant. Answer the question using ONLY the provided context. If the answer isn't in the context, say so.

📎 Context — the retrieved chunks (with source labels)

[Source 1: company_handbook.pdf, Page 4]
Our company offers 20 days of paid leave per year for full-time employees. Part-time employees receive 10 days. Unused leave can be carried over...

[Source 2: company_handbook.pdf, Page 4]
Emergency leave can be requested with manager approval. The company also provides 10 paid public holidays...

... + 3 more chunks

❓ Question

How many vacation days do I get?

📐 Instructions

Be precise. Cite which document and page the info comes from.

⚠️ Why "ONLY the context" matters

Without this instruction, Llama might make up leave policies from its training data. With this instruction, it can only use what's in the chunks. If the answer isn't there, it says "I don't know." That's how you prevent hallucination.

5

The payoff

Llama Reads the Cheat Sheet. Answers the Question.

Llama 3.2 (running locally on your machine via Ollama) reads the entire prompt — system instruction, chunks with source labels, question, and rules. Then it generates an answer, one word at a time, grounded in the context.

This is the "generation" part of Retrieval-Augmented Generation. The AI isn't remembering facts from training — it's reading your documents right now and synthesizing an answer.

💬 Llama's answer

According to the company handbook (Source 1, page 4), full-time employees receive 20 days of paid leave per year. Part-time employees get 10 days.

You can carry over up to 5 unused days to the following year. Leave requests must be submitted at least 2 weeks in advance through the HR portal.

Additionally, the company provides 10 paid public holidays per year (Source 2, page 4), and emergency leave is available with manager approval.

🔍 Count the citations

Every single fact traces back to a source. "20 days" → Source 1, page 4. "10 public holidays" → Source 2, page 4. Nothing is made up. If you don't trust the answer, you can open the PDF, go to page 4, and check with your own eyes.

Metric	Value	What it means
Generation time	~8 seconds	On your i5 CPU — no GPU needed
Tokens generated	~95	About a paragraph of text
Temperature	0.1	Near-deterministic — same question = same answer
Cost	$0.00	Everything runs locally on your machine

6

The receipts

Sources. Because "Trust Me Bro" Isn't a Citation.

The final output includes the answer and the exact sources. Filename, page number, even the character offset so you can find the exact paragraph. This is what makes RAG trustworthy — every answer has receipts.

============================================================
❓ Question: How many vacation days do I get?
============================================================

💬 Answer:
According to the company handbook (Source 1, page 4),
full-time employees receive 20 days of paid leave...

📎 Sources Retrieved:
  [1] company_handbook.pdf — page 4, char offset 0
      Preview: Our company offers 20 days of paid leave...
  [2] company_handbook.pdf — page 4, char offset 580
      Preview: Emergency leave can be requested with...
  [3] company_handbook.pdf — page 6, char offset 0
      Preview: Sick leave policy covers up to 15 days...

Field	What it tells you
source_file	Which file → open it and verify
page	Which page → jump straight there
char offset	Which paragraph on the page → pinpoint the exact text
preview	First 120 characters → quick sanity check without opening the file

The Entire Query Phase. One Breath.

"Vacation days?" → Convert to numbers → Find closest matches → Give chunks to Llama → Cited answer

Question → Numbers → Search → Read → Answer. ~8 seconds. $0. Fully local.

The Jargon Glossary

Next time someone uses these at a party, you'll know what they actually mean.

They say	They mean
"Vector embedding"	A list of numbers representing the meaning of text
"384-dimensional space"	The list has 384 numbers in it
"Cosine similarity"	How similar two lists of numbers are (0 = different, 1 = same)
"Semantic search"	Search by meaning, not by exact word matching
"Vector database"	A database that stores and searches lists of numbers
"Retrieval-Augmented Generation"	Give the AI relevant documents before it answers
"Chunking with overlap"	Cutting text into pieces with safety margins
"Grounding"	Forcing the AI to only use the provided documents
"Hallucination"	When the AI confidently makes stuff up
"Context window"	How much text the AI can read at once
"Temperature"	Randomness dial. Low = predictable. High = creative.
"Token"	A word or piece of a word. ~1.3 tokens per word.