LangChain Document Loading and Chunking |
\n\nIn previous articles, we manually entered text. However, in real projects, documents may come from PDFs, web pages, Markdown files, and more.
\n\nThis section introduces how to use Document Loaders to load various document types, and how to use Text Splitters to split documents into smaller chunks suitable for retrieval.
\n\n\n\n
Document Loader β Loading Documents
\n\nLangChain provides dozens of document loaders covering common file formats:
\n\n| Loader | \nSource | \nInstallation Package | \n
|---|---|---|
| TextLoader | \n.txt files | \nlangchain (built-in) | \n
| PyPDFLoader | \nPDF files | \nlangchain-community + pypdf | \n
| WebBaseLoader | \nWeb page URLs | \nlangchain-community + beautifulsoup4 | \n
| CSVLoader | \nCSV files | \nlangchain-community | \n
| UnstructuredMarkdownLoader | \nMarkdown files | \nlangchain-community + unstructured | \n
Examples
\n\n# Load a text file (built-in, no extra installation required)\n\nfrom langchain_community.document_loaders import TextLoader\n\nloader = TextLoader("knowledge.txt", encoding="utf-8")\n\ndocs = loader.load()\n\nprint(f"Loaded {len(docs)} document(s)")\n\nprint(f"Content preview: {docs.page_content[:150]}...")\n\n\n# Load a web page\n\n# pip install langchain-community beautifulsoup4\n\nfrom langchain_community.document_loaders import WebBaseLoader\n\nloader = WebBaseLoader("")\n\ndocs = loader.load()\n\nprint(f"n Web page content: {docs.page_content[:150]}...")\n\n\n\n\n
Text Splitter β Document Chunking
\n\nDocuments are often too long and need to be split into smaller chunks (chunks) for effective retrieval. The chunking strategy directly affects RAG performance:
\n\nExample
\n\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\n\n# Create a text splitter\n\ntext_splitter = RecursiveCharacterTextSplitter(\n\n chunk_size=500, # Maximum 500 characters per chunk\n\n chunk_overlap=50, # 50 characters overlap between chunks\n\n separators=["nn", "n", "γ", "οΌ", "οΌ", "οΌ", "οΌ", " ", ""],\n\n # Prioritize splitting by paragraphs, then sentences, finally characters\n\n)\n\n# Example document\n\nlong_text = """οΌTUTORIALοΌRunoob is a free programming learning platform.\nThe platform provides a wide range of programming language tutorials, including but not limited to:\n\n - Python Tutorial: From Basic Syntax to Data Analysis\n\n - Java Tutorial: From Object-Oriented Programming to the Spring Framework\n\n - Front-End Tutorials: HTML, CSS, JavaScript, and Their Frameworks\n\nAll tutorials come with detailed code examples and an online execution environment.\n\n Learners can quickly master programming skills through a learn-by-doing approach."""\n\n# Split the document\n\nchunks = text_splitter.split_text(long_text)\n\nprint(f"Original length: {len(long_text)} characters")\n\nprint(f"After splitting: {len(chunks)} chunksn")\n\nfor i, chunk in enumerate(chunks):\n\n print(f"--- Chunk {i+1} ({len(chunk)} chars) ---")\n\n print(chunk)\n\n print()\n\n\nOutput:
\n\nOriginal length: 153 characters\nAfter splitting: 3 chunks\n--- Chunk 1 (54 chars) ---\nοΌTUTORIALοΌRunoob is a free programming learning platform.The platform provides a wide range of programming language tutorials, including but not limited to:\n--- Chunk 2 (49 chars) ---\n - Python Tutorial: From Basic Syntax to Data Analysis\n - Java Tutorial: From Object-Oriented Programming to the Spring Framework\n--- Chunk 3 (50 chars) ---\n - Front-End Tutorials: HTML, CSS, JavaScript, and Their Frameworks\nAll tutorials come with detailed code examples and an online execution environment.\n\n\n\n\n\nNote: chunk_overlap is important. Without overlap, a complete sentence might be split in half, causing key information to be missed during retrieval. An overlap of 50β100 characters is common.
\n
\n\n
Chunking Parameter Guidelines
\n\n| Scenario | \nchunk_size | \nchunk_overlap | \nReason | \n
|---|---|---|---|
| FAQ Q&A | \n200~500 | \n20~50 | \nQ&A pairs are short; small chunks suffice | \n
| Technical documentation | \n500~1000 | \n50~100 | \nTechnical content requires more context | \n
| Long articles / academic papers | \n1000~2000 | \n100~200 | \nParagraph integrity must be preserved | \n
| Code repositories | \n500~1500 | \n0~50 | \nFunctions/classes serve as natural boundaries | \n
\n\n
Full Workflow: Load β Split β Embed
\n\nExample
\n\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\n\nfrom langchain_openai import OpenAIEmbeddings\n\nfrom langchain_chroma import Chroma\n\n# from langchain_community.document_loaders import TextLoader\n\n# Step 1: Load\n\n# loader = TextLoader("tutorial_knowledge.txt", encoding="utf-8")\n\n# docs = loader.load()\n\n# For demonstration, use sample text directly\n\ndocs = [\n "οΌTUTORIALοΌRunoob is a free programming learning website.",\n "The website provides tutorials for various programming languages, including Python, Java, HTML, and more.",\n "Python3 The basic tutorial consists of 30 chapters, making it suitable for absolute beginners.",\n "HTML The basic tutorial consists of 25 chapters, covering forms, multimedia, and more.",\n "All of Runoob's basic tutorials are free.",\n]\n\n# Step 2: Split\n\ntext_splitter = RecursiveCharacterTextSplitter(\n chunk_size=100,\n chunk_overlap=20,\n)\n\nchunks = text_splitter.create_documents(docs)\n\n# Step 3: Embed and store\n\nembeddings = OpenAIEmbeddings(model="text-embedding-3-small")\n\nvector_store = Chroma.from_documents(\n documents=chunks,\n embedding=embeddings,\n persist_directory="./tutorial_db",\n)\n\nprint(f"Index built: {len(chunks)} document chunks")\n\n# Step 4: Retrieve\n\nresults = vector_store.similarity_search("Python How many chapters are in the tutorial?", k=2)\n\nfor doc in results:\n print(f"Retrieval result: {doc.page_content}")\n\n\nOutput:
\n\nIndex built: 5 document chunks\nRetrieval result: Python3 The basic tutorial consists of 30 chapters, making it suitable for absolute beginners.\nRetrieval result: οΌTUTORIALοΌRunoob is a free programming learning website.\n
YouTip