Splitting Text with RecursiveCharacterTextSplitter
Why split files into smaller chunks? Large books or manuals exceed LLM context limit windows, and processing them entirely is expensive. Text Splitters slice long documents into small, coherent chunks before generating database vector representations.
1. Why Recursive Splitting is Preferred?
Simple text splitters cut strings after a fixed character count, frequently splitting sentences in half and separating key subjects.
RecursiveCharacterTextSplitter is smarter. It attempts to split by a prioritized array of characters:
- Paragraph boundaries (
\n\n) - Line boundaries (
\n) - Space characters (
) - Empty strings (individual letters)
This hierarchy keeps related sentences within the same chunk block.
2. Implementing the Splitter in Node.js
Configure size limit rules:
// src/services/textSplitting.ts
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { Document } from "@langchain/core/documents";
export async function splitDocumentsIntoChunks(rawDocs: Document[]) {
// 1. Instantiate the splitter with configuration parameters
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Target limit count per chunk (characters or tokens)
chunkOverlap: 200, // Number of overlapping characters between adjacent chunks
});
// 2. Process documents array
const splitChunks = await splitter.splitDocuments(rawDocs);
console.log("Original docs count:", rawDocs.length);
console.log("Generated chunks count:", splitChunks.length);
return splitChunks;
}3. Understanding Overlap
Setting a chunkOverlap value (e.g. 200 characters) ensures that the end of Chunk 1 contains the beginning text of Chunk 2. This overlap prevents semantic loss at boundary seams.
| Chunk 1 (Chars 0 - 1000) |
| ... the product price is $250 per user. |
| Chunk 2 (Chars 800 - 1800) |
| the product price is $250 per user. Monthly tiers are billed ... |