A Practical Introduction to RAG: Building Context-Aware LLM Apps with Vector DBs

While large language models (LLMs) are incredibly smart, they suffer from two major limitations: their knowledge is cut off at their training date, and they do not know anything about your private, custom data.

Retrieval-Augmented Generation (RAG) is the industry standard pattern to solve these limitations. Let us explore how it works and how vector databases make it possible.

1. What is RAG?

RAG is a three-step process that enhances an LLM prompt with external context before sending it to the model:

Retrieve: Find relevant documents or information from an external source (like a database or document collection) based on the user's query.
Augment: Combine the retrieved context with the user's original query to create a rich prompt.
Generate: Send the augmented prompt to the LLM to generate a factual, context-aware response.

2. The Role of Vector Embeddings

To retrieve the most relevant documents, we cannot rely on simple keyword search. Instead, we use Vector Embeddings.

An embedding is a numerical representation of semantic meaning. Text is converted into a list of numbers (a vector) by an embedding model. If two pieces of text are semantically similar, their vectors will be close together in multi-dimensional space.

// Example of converting text to vector using OpenAI API
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: "What is Retrieval-Augmented Generation?",
});
const vector = response.data[0].embedding; // A list of 1536 numbers

3. Storing and Searching in Vector Databases

Vector Databases (like Pinecone, Milvus, or Supabase pgvector) are optimized for storing these multi-dimensional vectors and performing fast similarity searches (e.g. Cosine Similarity) to find the nearest vectors to a user's query.

Here is the general workflow for query retrieval:

Embed the user's query using the same embedding model.
Query the vector database to find the top K nearest vectors (e.g. K = 3).
Retrieve the matching original text documents.

4. Assembling the Augmented Prompt

Once you have the text documents, assemble the prompt to send to the LLM:

Use the following context to answer the user question:

Context:
[Retrieved Document 1]
[Retrieved Document 2]

Question: [User's Original Question]

This ensures that the model generates an answer grounded strictly in your provided context, reducing hallucinations significantly.

Conclusion

RAG bridges the gap between static LLMs and dynamic, private business data. By implementing vector embeddings and vector databases, you can build context-aware applications that give accurate answers tailored to your specific documentation.