YouTip LogoYouTip

Langchain Document Loaders

LangChain Document Loading and Chunking |

\n\n

In previous articles, we manually entered text. However, in real projects, documents may come from PDFs, web pages, Markdown files, and more.

\n\n

This section introduces how to use Document Loaders to load various document types, and how to use Text Splitters to split documents into smaller chunks suitable for retrieval.

\n\n
\n\n

Document Loader β€” Loading Documents

\n\n

LangChain provides dozens of document loaders covering common file formats:

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
LoaderSourceInstallation Package
TextLoader.txt fileslangchain (built-in)
PyPDFLoaderPDF fileslangchain-community + pypdf
WebBaseLoaderWeb page URLslangchain-community + beautifulsoup4
CSVLoaderCSV fileslangchain-community
UnstructuredMarkdownLoaderMarkdown fileslangchain-community + unstructured
\n\n

Examples

\n\n
# Load a text file (built-in, no extra installation required)\n\nfrom langchain_community.document_loaders import TextLoader\n\nloader = TextLoader("knowledge.txt", encoding="utf-8")\n\ndocs = loader.load()\n\nprint(f"Loaded {len(docs)} document(s)")\n\nprint(f"Content preview: {docs.page_content[:150]}...")\n
\n\n
# Load a web page\n\n# pip install langchain-community beautifulsoup4\n\nfrom langchain_community.document_loaders import WebBaseLoader\n\nloader = WebBaseLoader("")\n\ndocs = loader.load()\n\nprint(f"n Web page content: {docs.page_content[:150]}...")\n
\n\n
\n\n

Text Splitter β€” Document Chunking

\n\n

Documents are often too long and need to be split into smaller chunks (chunks) for effective retrieval. The chunking strategy directly affects RAG performance:

\n\n

Example

\n\n
from langchain_text_splitters import RecursiveCharacterTextSplitter\n\n# Create a text splitter\n\ntext_splitter = RecursiveCharacterTextSplitter(\n\n    chunk_size=500,  # Maximum 500 characters per chunk\n\n    chunk_overlap=50,  # 50 characters overlap between chunks\n\n    separators=["nn", "n", "。", "!", "?", "οΌ›", ",", " ", ""],\n\n    # Prioritize splitting by paragraphs, then sentences, finally characters\n\n)\n\n# Example document\n\nlong_text = """(TUTORIALοΌ‰Runoob is a free programming learning platform.\nThe platform provides a wide range of programming language tutorials, including but not limited to:\n\n - Python Tutorial: From Basic Syntax to Data Analysis\n\n - Java Tutorial: From Object-Oriented Programming to the Spring Framework\n\n - Front-End Tutorials: HTML, CSS, JavaScript, and Their Frameworks\n\nAll tutorials come with detailed code examples and an online execution environment.\n\n Learners can quickly master programming skills through a learn-by-doing approach."""\n\n# Split the document\n\nchunks = text_splitter.split_text(long_text)\n\nprint(f"Original length: {len(long_text)} characters")\n\nprint(f"After splitting: {len(chunks)} chunksn")\n\nfor i, chunk in enumerate(chunks):\n\n    print(f"--- Chunk {i+1} ({len(chunk)} chars) ---")\n\n    print(chunk)\n\n    print()\n
\n\n

Output:

\n\n
Original length: 153 characters\nAfter splitting: 3 chunks\n--- Chunk 1 (54 chars) ---\n(TUTORIALοΌ‰Runoob is a free programming learning platform.The platform provides a wide range of programming language tutorials, including but not limited to:\n--- Chunk 2 (49 chars) ---\n - Python Tutorial: From Basic Syntax to Data Analysis\n - Java Tutorial: From Object-Oriented Programming to the Spring Framework\n--- Chunk 3 (50 chars) ---\n - Front-End Tutorials: HTML, CSS, JavaScript, and Their Frameworks\nAll tutorials come with detailed code examples and an online execution environment.\n
\n\n
\n

Note: chunk_overlap is important. Without overlap, a complete sentence might be split in half, causing key information to be missed during retrieval. An overlap of 50–100 characters is common.

\n
\n\n
\n\n

Chunking Parameter Guidelines

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Scenariochunk_sizechunk_overlapReason
FAQ Q&A200~50020~50Q&A pairs are short; small chunks suffice
Technical documentation500~100050~100Technical content requires more context
Long articles / academic papers1000~2000100~200Paragraph integrity must be preserved
Code repositories500~15000~50Functions/classes serve as natural boundaries
\n\n
\n\n

Full Workflow: Load β†’ Split β†’ Embed

\n\n

Example

\n\n
from langchain_text_splitters import RecursiveCharacterTextSplitter\n\nfrom langchain_openai import OpenAIEmbeddings\n\nfrom langchain_chroma import Chroma\n\n# from langchain_community.document_loaders import TextLoader\n\n# Step 1: Load\n\n# loader = TextLoader("tutorial_knowledge.txt", encoding="utf-8")\n\n# docs = loader.load()\n\n# For demonstration, use sample text directly\n\ndocs = [\n    "(TUTORIALοΌ‰Runoob is a free programming learning website.",\n    "The website provides tutorials for various programming languages, including Python, Java, HTML, and more.",\n    "Python3 The basic tutorial consists of 30 chapters, making it suitable for absolute beginners.",\n    "HTML The basic tutorial consists of 25 chapters, covering forms, multimedia, and more.",\n    "All of Runoob's basic tutorials are free.",\n]\n\n# Step 2: Split\n\ntext_splitter = RecursiveCharacterTextSplitter(\n    chunk_size=100,\n    chunk_overlap=20,\n)\n\nchunks = text_splitter.create_documents(docs)\n\n# Step 3: Embed and store\n\nembeddings = OpenAIEmbeddings(model="text-embedding-3-small")\n\nvector_store = Chroma.from_documents(\n    documents=chunks,\n    embedding=embeddings,\n    persist_directory="./tutorial_db",\n)\n\nprint(f"Index built: {len(chunks)} document chunks")\n\n# Step 4: Retrieve\n\nresults = vector_store.similarity_search("Python How many chapters are in the tutorial?", k=2)\n\nfor doc in results:\n    print(f"Retrieval result: {doc.page_content}")\n
\n\n

Output:

\n\n
Index built: 5 document chunks\nRetrieval result: Python3 The basic tutorial consists of 30 chapters, making it suitable for absolute beginners.\nRetrieval result: (TUTORIALοΌ‰Runoob is a free programming learning website.\n
← Langchain Project Customer SerLangchain Multi Agent β†’