Ai Rag
RAG Retrieval-Augmented Generation | Online Tutorial
Have you encountered such scenarios:
* You have a hundreds-page product manual, and a user asks a very specific question. You flip through it for a long time but can't find where the answer is.
* Your company has accumulated years of internal documents, meeting minutes, and technical specifications. A new employee wants to understand a certain business rule but has no idea where to start looking.
* You have a bunch of e-books, research reports, and papers, and you want AI to answer questions based on this content. But if you paste the entire book to AI, the context window can't hold it all.
This is the problem RAG aims to solve.
RAG (Retrieval-Augmented Generation) is a technology that allows large language models to answer questions based on a specific knowledge base.
The core idea of RAG is simple: first, find the most relevant content from your knowledge base, then send this content along with the user's question to the LLM, allowing the LLM to answer based on this context.
> Without RAG, LLM only knows what's in its training data; with RAG, LLM can use your private, up-to-date data.
* * *
## Why RAG is Needed
RAG is not the only solution to these problems, but it is currently one of the most practical and cost-effective solutions.
### LLM Knowledge Cutoff Problem
* All LLMs have a knowledge cutoff dateβthe training data only goes up to a certain point, and it doesn't know about events after that.
* For example, GPT-4's knowledge cutoff is October 2023. If you ask about news from 2024, it can only say it doesn't know.
* More importantly, your private data (internal documents, product manuals, company specifications) has never appeared in the LLM's training data. How could it possibly know?
### Private Data Cannot Be Directly Input to LLM
You might think: can't I just paste the document to LLM?
The answer is: short documents are fine, long documents are not.
* First, LLM's context window is limited. For example, GPT-3.5 only has 16K tokens, which is about 10,000 characters. A hundreds-page manual simply won't fit.
* Second, even if the context window is large, stuffing the entire document into it doesn't work well. LLMs easily get lost in long text and can't find the truly relevant information.
* Finally, there's the cost issue. GPT-4 with 128K context costs $10 per million input tokens. If you stuff an entire book every time, your wallet will cry.
### RAG vs Fine-tuning: How to Choose
Another way to make LLM use new data is fine-tuningβcontinuing to train the model with new data.
But fine-tuning and RAG have essential differences and are suitable for completely different scenarios:
| Comparison | RAG | Fine-tuning |
| --- | --- | --- |
| Data Freshness | Update anytime, immediately available | Requires retraining, long cycle |
| Applicable Data Volume | Very large (million-level documents) | Medium (thousands to tens of thousands of samples) |
| Source Citation | Can show which document the answer came from | Cannot trace the source |
| Hallucination Problem | Reduces hallucinations by providing context | May still produce hallucinations |
| Modifying Knowledge | Simply delete or update documents | Requires retraining, difficult to "forget" |
| Technical Threshold | Lower, can use existing frameworks | Higher, requires GPU and training experience |
| Cost | Mainly vector database and Embedding | High training cost, high inference cost |
> A simple rule of thumb: If you want LLM to "know" certain factual knowledge (like product descriptions, company specifications), use RAG; if you want LLM to "learn" a certain style or capability (like writing style, code standards), use fine-tuning.
* * *
## RAG Architecture Overview
RAG consists of two phases: offline indexing phase and online retrieval phase.
Let's look at a complete architecture diagram:

### Offline Indexing Phase (Data Preparation)
This phase runs in the background and is not directly visible to users.
Its task is to process your documents into a format that can be quickly retrieved.
The steps are:
* 1. **Document Loading**: Read documents in various formats like PDF, TXT, DOCX, and web pages.
* 2. **Document Chunking**: Split long documents into small text chunks, typically a few hundred to a thousand words.
* 3. **Vectorization**: Use an Embedding model to convert each text chunk into a vector (a string of numbers).
* 4. **Storage**: Store the vectors and original text together in a vector database.
This phase only needs to be done once, or rerun when documents are updated.
### Online Retrieval Phase (User Query)
This phase happens in real-time when users ask questions.
The steps are:
* 1. **Query Vectorization**: Use the same Embedding model to convert the user's question into a vector.
* 2. **Similarity Retrieval**: Find the text chunks closest to the query vector in the vector database.
* 3. **Build Prompt**: Assemble the retrieved text chunks as context along with the user's question into a Prompt.
* 4. **LLM Generation**: Send the Prompt to the LLM and let it answer based on the context.
* 5. **Return Answer**: Return the LLM's answer to the user, usually with citations to the source documents.
* * *
## Introduction to Vector Database
Vector database is one of the core components of RAG. To understand it, you must first understand what a vector is.
### What is a Vector (Embedding)
A vector (Embedding) is a string of numbers converted from text.
For example, the word "cat" might be converted into a vector of several hundred dimensions like [0.23, -0.45, 0.12, 0.89, ...].
The key point is: semantically similar texts will also be close to each other in vector space.
For example:
* The vector distance between "cat" and "kitty" is very close
* The vector distance between "cat" and "dog" is closer than to "car"
* The vector distance between "I like cats" and "I love kitties" is very close
This is the basis for "semantic search"βmatching by meaning rather than keywords.
### Semantic Similarity Calculation Principle
The similarity between two vectors is usually calculated using "Cosine Similarity".
The range of cosine similarity is -1 to 1:
* 1 means identical
* 0 means unrelated
* -1 means completely opposite
In practical applications, we usually only care about positive similarityβthe closer to 1, the more relevant.
When you ask a question, the vector database quickly calculates the similarity between the query vector and all document vectors, returning the top most relevant ones.
### Popular Vector Databases
There are many vector databases on the market, each with its own characteristics:
| Database | Type | Features | Applicable Scenarios |
| --- | --- | --- | --- |
| Chroma | Local/Open Source | Lightweight, easy to use, Python friendly | Prototype development, small-scale applications |
| Pinecone | Cloud Service/SaaS | Managed service, no maintenance needed, elastic scaling | Production environment, large-scale applications |
| Weaviate | Open Source/Managed | Rich features, supports GraphQL | Scenarios requiring advanced features |
| Qdrant | Open Source/Managed | High performance, written in Rust | Scenarios with high performance requirements |
| Milvus | Open Source/Managed | Full-featured, enterprise-grade | Large-scale enterprise applications |
| FAISS | Local Library | Facebook's product, extremely fast | Scenarios without need for persistence |
For beginners, we recommend starting with Chromaβit's simple to install, requires no additional configuration, and is perfect for learning and prototype development.
* * *
## Document Processing Pipeline
Document processing is a part of RAG that is easily overlooked but actually very important.
Whether your retrieval works well largely depends on how well the documents are processed.
### Supported Document Formats
LangChain supports many document formats:
* Plain text: .txt, .md
* PDF: .pdf
* Office: .docx, .pptx, .xlsx
* Web: HTML, URL
* Code: .py, .js, .java, etc.
* JSON: .json, .jsonl
* CSV: .csv
Each format has a corresponding loader, making them easy to use.
### Document Chunking Strategies
Document chunking is the process of cutting long documents into small pieces.
This seems simple, but there are many considerations:
* Chunk too small: may lose context, a complete meaning gets cut off
YouTip