LangChain Document Loading and Chunking |

\n\n

In previous articles, we manually entered text. However, in real projects, documents may come from PDFs, web pages, Markdown files, and more.

\n\n

This section introduces how to use Document Loaders to load various document types, and how to use Text Splitters to split documents into smaller chunks suitable for retrieval.

\n\n

Document Loader — Loading Documents

\n\n

LangChain provides dozens of document loaders covering common file formats:

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

Loader	Source	Installation Package
TextLoader	.txt files	langchain (built-in)
PyPDFLoader	PDF files	langchain-community + pypdf
WebBaseLoader	Web page URLs	langchain-community + beautifulsoup4
CSVLoader	CSV files	langchain-community
UnstructuredMarkdownLoader	Markdown files	langchain-community + unstructured

\n\n

Examples

\n\n

# Load a text file (built-in, no extra installation required)\n\nfrom langchain_community.document_loaders import TextLoader\n\nloader = TextLoader("knowledge.txt", encoding="utf-8")\n\ndocs = loader.load()\n\nprint(f"Loaded {len(docs)} document(s)")\n\nprint(f"Content preview: {docs.page_content[:150]}...")\n

\n\n

# Load a web page\n\n# pip install langchain-community beautifulsoup4\n\nfrom langchain_community.document_loaders import WebBaseLoader\n\nloader = WebBaseLoader("")\n\ndocs = loader.load()\n\nprint(f"n Web page content: {docs.page_content[:150]}...")\n

\n\n

Text Splitter — Document Chunking

\n\n

Documents are often too long and need to be split into smaller chunks (chunks) for effective retrieval. The chunking strategy directly affects RAG performance:

\n\n

Example

\n\n

from langchain_text_splitters import RecursiveCharacterTextSplitter\n\n# Create a text splitter\n\ntext_splitter = RecursiveCharacterTextSplitter(\n\n    chunk_size=500,  # Maximum 500 characters per chunk\n\n    chunk_overlap=50,  # 50 characters overlap between chunks\n\n    separators=["nn", "n", "。", "！", "？", "；", "，", " ", ""],\n\n    # Prioritize splitting by paragraphs, then sentences, finally characters\n\n)\n\n# Example document\n\nlong_text = """（TUTORIAL）Runoob is a free programming learning platform.\nThe platform provides a wide range of programming language tutorials, including but not limited to:\n\n - Python Tutorial: From Basic Syntax to Data Analysis\n\n - Java Tutorial: From Object-Oriented Programming to the Spring Framework\n\n - Front-End Tutorials: HTML, CSS, JavaScript, and Their Frameworks\n\nAll tutorials come with detailed code examples and an online execution environment.\n\n Learners can quickly master programming skills through a learn-by-doing approach."""\n\n# Split the document\n\nchunks = text_splitter.split_text(long_text)\n\nprint(f"Original length: {len(long_text)} characters")\n\nprint(f"After splitting: {len(chunks)} chunksn")\n\nfor i, chunk in enumerate(chunks):\n\n    print(f"--- Chunk {i+1} ({len(chunk)} chars) ---")\n\n    print(chunk)\n\n    print()\n

\n\n

Output:

\n\n

Original length: 153 characters\nAfter splitting: 3 chunks\n--- Chunk 1 (54 chars) ---\n（TUTORIAL）Runoob is a free programming learning platform.The platform provides a wide range of programming language tutorials, including but not limited to:\n--- Chunk 2 (49 chars) ---\n - Python Tutorial: From Basic Syntax to Data Analysis\n - Java Tutorial: From Object-Oriented Programming to the Spring Framework\n--- Chunk 3 (50 chars) ---\n - Front-End Tutorials: HTML, CSS, JavaScript, and Their Frameworks\nAll tutorials come with detailed code examples and an online execution environment.\n

\n\n

\n
Note: chunk_overlap is important. Without overlap, a complete sentence might be split in half, causing key information to be missed during retrieval. An overlap of 50–100 characters is common.
\n

\n\n

Chunking Parameter Guidelines

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

Scenario	chunk_size	chunk_overlap	Reason
FAQ Q&A	200~500	20~50	Q&A pairs are short; small chunks suffice
Technical documentation	500~1000	50~100	Technical content requires more context
Long articles / academic papers	1000~2000	100~200	Paragraph integrity must be preserved
Code repositories	500~1500	0~50	Functions/classes serve as natural boundaries

\n\n

Full Workflow: Load → Split → Embed

\n\n

Example

\n\n

from langchain_text_splitters import RecursiveCharacterTextSplitter\n\nfrom langchain_openai import OpenAIEmbeddings\n\nfrom langchain_chroma import Chroma\n\n# from langchain_community.document_loaders import TextLoader\n\n# Step 1: Load\n\n# loader = TextLoader("tutorial_knowledge.txt", encoding="utf-8")\n\n# docs = loader.load()\n\n# For demonstration, use sample text directly\n\ndocs = [\n    "（TUTORIAL）Runoob is a free programming learning website.",\n    "The website provides tutorials for various programming languages, including Python, Java, HTML, and more.",\n    "Python3 The basic tutorial consists of 30 chapters, making it suitable for absolute beginners.",\n    "HTML The basic tutorial consists of 25 chapters, covering forms, multimedia, and more.",\n    "All of Runoob's basic tutorials are free.",\n]\n\n# Step 2: Split\n\ntext_splitter = RecursiveCharacterTextSplitter(\n    chunk_size=100,\n    chunk_overlap=20,\n)\n\nchunks = text_splitter.create_documents(docs)\n\n# Step 3: Embed and store\n\nembeddings = OpenAIEmbeddings(model="text-embedding-3-small")\n\nvector_store = Chroma.from_documents(\n    documents=chunks,\n    embedding=embeddings,\n    persist_directory="./tutorial_db",\n)\n\nprint(f"Index built: {len(chunks)} document chunks")\n\n# Step 4: Retrieve\n\nresults = vector_store.similarity_search("Python How many chapters are in the tutorial?", k=2)\n\nfor doc in results:\n    print(f"Retrieval result: {doc.page_content}")\n

\n\n

Output:

\n\n

Index built: 5 document chunks\nRetrieval result: Python3 The basic tutorial consists of 30 chapters, making it suitable for absolute beginners.\nRetrieval result: （TUTORIAL）Runoob is a free programming learning website.\n

YouTip

Langchain Document Loaders

LangChain Document Loading and Chunking |

Document Loader — Loading Documents

Examples

Text Splitter — Document Chunking

Example

Chunking Parameter Guidelines

Full Workflow: Load → Split → Embed

Example

📂 Categories