Retrieval Augmented Generation
RAG (Retrieval-Augmented Generation) is currently one of the most mainstream LLM implementation architectures.
The core idea of RAG is: **Let LLM first retrieve relevant content from an external knowledge base when answering questions, then generate answers based on the retrieval results**, rather than relying solely on the knowledge memorized during model training.
This addresses two core pain points of LLMs: knowledge cutoff (the model doesn't know what happened after training) and hallucination issues (the model fabricates answers when uncertain).
* * *
## RAG Basic Principles
A complete RAG system consists of two pipelines: **offline indexing pipeline** (preprocessing documents and storing them in a vector database) and **online query pipeline** (receiving user questions, retrieving, and generating).
The offline phase splits raw documents into chunks, converts them into vectors through an Embedding model, and stores them in a vector database.
The online phase converts the user's question into a vector as well, finds the most similar document chunks from the database, concatenates them into context, and passes them to the LLM for answer generation.
The following diagram shows the complete request flow of RAG:
* * *
## Data Preprocessing and Document Chunking
### Pre-challenge: Complex Document Parsing
Before chunking, RAG often faces challenges with **format parsing**. Especially in PDFs, Word documents, or scanned documents containing tables, images, and multi-column layouts, ordinary text extraction can easily cause semantic confusion.
Currently, the mainstream industry solution is to introduce **document parsing engines** (such as LlamaParse, Unstructured) or multimodal large models to convert complex images and text into structured Markdown, laying a foundation for high-quality chunking.
### Document Chunking Strategies
Document chunking is the foundation of RAG effectiveness, and the chunking granularity directly affects retrieval quality. Too large chunks introduce noise, too small chunks lose context. Common strategies are as follows:
| Chunking Strategy | Applicable Scenarios | Advantages | Disadvantages |
| --- | --- | --- | --- |
| **Fixed-size Chunking** | General text | Simple implementation, fast speed | May cut off semantically complete sentences |
| **Recursive Character Chunking** | Structured text (Markdown, code) | Prioritizes chunking at semantic boundaries like paragraphs and sentences | Slightly complex implementation, requires setting a reasonable list of separators |
| **Semantic Chunking** | Long documents, books | Uses Embedding to calculate similarity between adjacent sentences, automatically finds semantic turning points for chunking | High computational cost, slow preprocessing |
| **Parent-Child Document Retrieval (Small-to-Big)** | Comprehensive coverage scenarios | Uses "small chunks" for high-precision vector retrieval, and returns the corresponding "large chunks" (parent documents) to the LLM upon hit, balancing retrieval precision and context completeness. | Doubles database design and maintenance costs |
> In practice, **overlap** is often added during chunking, meaning adjacent chunks share several characters to prevent important information from being cut off at boundaries. Typical configuration: chunk size 512 tokens, overlap 50~100 tokens.
## Example: Using LangChain for Recursive Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,# Maximum tokens per chunk
chunk_overlap=50,# Over
YouTip