Who pretend to know everything by using jargons.
Now the part where someone actually asks a question.
OK so you've done the prep work (Part 1). Your documents are chopped up, converted to numbers, and saved in a database. Now someone asks a question.
What happens next is what people call "semantic retrieval with augmented generation" — which is a 6-word way of saying "find the relevant bits, then have the AI read them and answer."
Let's trace a real question through every step.
Nothing fancy. Someone opens the terminal and types a question in plain English. No special syntax, no query language, no SQL.
Remember Step 3 from the indexing phase? Where we converted each chunk into 384 numbers? We do the exact same thing to the question. Same model. Same 384 numbers.
Why the same model? Because the numbers need to be in the same space. If you used a different model, the numbers would mean different things — like comparing temperatures in Celsius to Fahrenheit without converting.
This is what people mean when they say "cosine similarity search in a vector database."
In plain English: We compare the question's 384 numbers against every chunk's 384 numbers and see which ones are most similar. ChromaDB does this in about 5 milliseconds — even with thousands of chunks.
The result? A ranked list. The chunk with the most similar numbers wins.
This is what "augmented" means in RAG. We augment the AI's input with the retrieved chunks. We literally paste the chunks into the prompt and say "answer using ONLY this stuff."
People call this "grounding the LLM in retrieved context." You can call it "giving the AI a cheat sheet."
Llama 3.2 (running locally on your machine via Ollama) reads the entire prompt — system instruction, chunks with source labels, question, and rules. Then it generates an answer, one word at a time, grounded in the context.
This is the "generation" part of Retrieval-Augmented Generation. The AI isn't remembering facts from training — it's reading your documents right now and synthesizing an answer.
| Metric | Value | What it means |
|---|---|---|
| Generation time | ~8 seconds | On your i5 CPU — no GPU needed |
| Tokens generated | ~95 | About a paragraph of text |
| Temperature | 0.1 | Near-deterministic — same question = same answer |
| Cost | $0.00 | Everything runs locally on your machine |
The final output includes the answer and the exact sources. Filename, page number, even the character offset so you can find the exact paragraph. This is what makes RAG trustworthy — every answer has receipts.
| Field | What it tells you |
|---|---|
| source_file | Which file → open it and verify |
| page | Which page → jump straight there |
| char offset | Which paragraph on the page → pinpoint the exact text |
| preview | First 120 characters → quick sanity check without opening the file |
Question → Numbers → Search → Read → Answer. ~8 seconds. $0. Fully local.
Next time someone uses these at a party, you'll know what they actually mean.
| They say | They mean |
|---|---|
| "Vector embedding" | A list of numbers representing the meaning of text |
| "384-dimensional space" | The list has 384 numbers in it |
| "Cosine similarity" | How similar two lists of numbers are (0 = different, 1 = same) |
| "Semantic search" | Search by meaning, not by exact word matching |
| "Vector database" | A database that stores and searches lists of numbers |
| "Retrieval-Augmented Generation" | Give the AI relevant documents before it answers |
| "Chunking with overlap" | Cutting text into pieces with safety margins |
| "Grounding" | Forcing the AI to only use the provided documents |
| "Hallucination" | When the AI confidently makes stuff up |
| "Context window" | How much text the AI can read at once |
| "Temperature" | Randomness dial. Low = predictable. High = creative. |
| "Token" | A word or piece of a word. ~1.3 tokens per word. |