What are LangChain Text Splitters
In recent times LangChain has evolved into a go-to framework for creating complex pipelines for working with LLMs. One of its important utility is the langchain_text_splitters package which contains various modules to split large textual data into more manageable chunks. Usually, LangChain Text Splitters are used in RAG architecture to chunk a large document and convert these chunks into embeddings to be stored in Vector DB. For LLMs with limited context-size windows, it is quite useful to retrieve relevant chunks of the document from Vector DB and pass it as context while inferencing.
LangChain Text Splitters offers the following types of splitters that are useful for different types of textual data or as per your splitting requirement.
- CharacterTextSplitter
- TokenTextSplitter
- RecursiveCharacterTextSplitter
- RecursiveJsonSplitter
- HTMLHeaderTextSplitter
- HTMLSectionSplitter
- MarkdownHeaderTextSplitter
- Code Splitter
We will cover the above splitters of langchain_text_splitters package one by one in detail with examples in the following sections.
Installation of langchain_text_splitters
The LangChain Text Splitters package can be installed by the following command. This is an important pre-requisite of using langchain_text_splitters examples in the subsequent sections.
%pip install -qU langchain-text-splitters
LangChain CharacterTextSplitter : Split By Character
The CharacterTextSplitter is a basic LangChain text splitter that splits the text based on a character separator, such as a space or a new line. By default, the separator is “\n\n”. This splitter is useful for unstructured text data that can be split based on special characters like newline, semi-colon, dot, etc.
Example
In this example, we first import CharacterTextSplitter module from langchain_text_splitters package. Next, we initialize the character text splitter with separator parameter as a semi-colon. We also pass chunk_size as 200 here which is calculated based on character length. It should be noted here that CharacterTextSplitter will not strictly split on chunk size, instead it will prioritize the split based on separator only but may throw an error or warning if the size of the resultant chunk exceeds chunk_size.
Finally, we split the text into chunks and the output is printed where we can see the text has been split based on the semi-colon.
from langchain_text_splitters import CharacterTextSplitter # Example text text = """Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering a wide range of AI-driven applications. As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in the market.""" # Initialize the CharacterTextSplitter splitter = CharacterTextSplitter(chunk_size=200,separator=',') # Split the text chunks = splitter.split_text(text) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}: {chunk}\n")
Output
Chunk 1: Vector databases have emerged as powerful tools for managing high-dimensional data Chunk 2: enabling efficient similarity searches and powering a wide range of AI-driven applications. As the demand for advanced data management solutions continues to grow Chunk 3: several vector databases have gained prominence in the market.
LangChain RecursiveCharacterTextSplitter : Split Recursively by Character
To split a non-structured text more generically, it is recommended to use this RecursiveCharacterTextSplitter which splits the text in order of [“\n\n”, “\n”, ” “, “”]. It essentially starts splitting with double newlines (\n\n), then single newlines (\n), spaces (‘ ‘), and finally, individual characters. This approach ensures that depending on the chunk_size, the natural structure and coherence of the text are preserved while splitting at natural boundaries like paragraphs and sentences.
Example
We start by importing the RecursiveCharacterTextSplitter module from the langchain_text_splitters package. To initialize Recursive Character Text Splitter we use chunk_size as 100 and chunk_overlap size as 10. The chunk_overlap signifies the number of characters that can overlap between two chunks.
Finally, the text is split and the chunks are printed as output.
from langchain_text_splitters import RecursiveCharacterTextSplitter # Example text text = """Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering a wide range of AI-driven applications. As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in the market.""" # Initialize the RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10) # Split the text chunks = splitter.split_text(text) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}: {chunk}\n")
Output
Chunk 1: Vector databases have emerged as powerful tools for managing high-dimensional data, enabling Chunk 2: enabling efficient similarity searches and powering a wide range of AI-driven applications. Chunk 3: As the demand for advanced data management solutions continues to grow, several vector databases Chunk 4: databases have gained prominence in the market.
LangChain TokenTextSplitter : Split By Token
The TokenTextSplitter splits the text based on token count rather than character count. This is useful since popular LLMs have context windows designed for token counts. How a token is constructed depends on the underlying tokenizer in use. LangChain currently supports the following tokenizers – Tiktoken, Spacy, SentenceTransformers, NLTK, HuggingFace Tokenizer, and KoNLPY. By default, Tiktoken is used.
Example
In this example, we start by importing the TokenTextSplitter module and then initialize the token text splitter with chunk_size as 20 and chunk_overlap as 5. The chunk_overlap determines how many tokens can overlap between two chunks.
Finally, the text is split and chunks are printed as output.
from langchain_text_splitters import TokenTextSplitter # Example text text = """Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering a wide range of AI-driven applications. As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in the market.""" # Initialize the TokenTextSplitter splitter = TokenTextSplitter(chunk_size=20, chunk_overlap=5) # Split the text chunks = splitter.split_text(text) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}: {chunk}\n")
Output
Chunk 1: Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering Chunk 2: efficient similarity searches and powering a wide range of AI-driven applications. As the demand for Chunk 3: As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in Chunk 4: databases have gained prominence in the market.
LangChain RecursiveJsonSplitter : Recursively Split JSON
The RecursiveJsonSplitter splits the JSON data into smaller JSON chunks by recursively traversing the JSON and maintaining the JSON structure in chunks. It also tries to preserve the nested JSON object entirely as long as the min and max chunk size criteria are matched, otherwise it will split it.
Example
In this example, we first import the RecursiveJsonSplitter module and then initialize the Recursive Json Splitter with max_chunk size as 500. Finally, we split the sample JSON text and print the resultant JSON chunks as output.
import requests from langchain_text_splitters import RecursiveJsonSplitter # Example JSON data json_data = { "section1": { "title": "Main Section 1", "content": "This is the content for main section 1.", "subsections": [ { "title": "Subsection 1.1", "content": "Content for subsection 1.1.", "details": ["Detail 1.1.1", "Detail 1.1.2", "Detail 1.1.3"] }, { "title": "Subsection 1.2", "content": "Content for subsection 1.2.", "details": ["Detail 1.2.1", "Detail 1.2.2", "Detail 1.2.3"] } ] }, "section2": { "title": "Main Section 2", "content": "This is the content for main section 2.", "subsections": [ { "title": "Subsection 2.1", "content": "Content for subsection 2.1.", "details": ["Detail 2.1.1", "Detail 2.1.2", "Detail 2.1.3"] }, { "title": "Subsection 2.2", "content": "Content for subsection 2.2.", "details": ["Detail 2.2.1", "Detail 2.2.2", "Detail 2.2.3"] } ] }, "section3": { "title": "Main Section 3", "content": "This is the content for main section 3.", "subsections": [ { "title": "Subsection 3.1", "content": "Content for subsection 3.1.", "details": ["Detail 3.1.1", "Detail 3.1.2", "Detail 3.1.3"] }, { "title": "Subsection 3.2", "content": "Content for subsection 3.2.", "details": ["Detail 3.2.1", "Detail 3.2.2", "Detail 3.2.3"] } ] } } # Initialize the RecursiveJsonSplitter splitter = RecursiveJsonSplitter(max_chunk_size=500) # Split the JSON data chunks = splitter.split_text(json_data) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}: {chunk}\n")
Output
Chunk 1: {"section1": {"title": "Main Section 1", "content": "This is the content for main section 1.", "subsections": [{"title": "Subsection 1.1", "content": "Content for subsection 1.1.", "details": ["Detail 1.1.1", "Detail 1.1.2", "Detail 1.1.3"]}, {"title": "Subsection 1.2", "content": "Content for subsection 1.2.", "details": ["Detail 1.2.1", "Detail 1.2.2", "Detail 1.2.3"]}]}} Chunk 2: {"section2": {"title": "Main Section 2", "content": "This is the content for main section 2.", "subsections": [{"title": "Subsection 2.1", "content": "Content for subsection 2.1.", "details": ["Detail 2.1.1", "Detail 2.1.2", "Detail 2.1.3"]}, {"title": "Subsection 2.2", "content": "Content for subsection 2.2.", "details": ["Detail 2.2.1", "Detail 2.2.2", "Detail 2.2.3"]}]}} Chunk 3: {"section3": {"title": "Main Section 3", "content": "This is the content for main section 3.", "subsections": [{"title": "Subsection 3.1", "content": "Content for subsection 3.1.", "details": ["Detail 3.1.1", "Detail 3.1.2", "Detail 3.1.3"]}, {"title": "Subsection 3.2", "content": "Content for subsection 3.2.", "details": ["Detail 3.2.1", "Detail 3.2.2", "Detail 3.2.3"]}]}}
LangChain MarkdownHeaderTextSplitter : Markdown Header Text Splitter
Markdown is a syntax that provides structure to textual data by adding Headings, Sub-headings, etc. So while chunking a markdown document it is important to preserve the boundary of different headings as much as possible. The MarkdownHeaderTextSplitter is used to split Markdown documents based on their header structure. It preserves the header metadata in the resulting chunks, allowing for structure-aware splitting.
Example
We start by importing the MarkdownHeaderTextSplitter module and then we initialize the Markdown Header Text Splitter by defining the markdown headers on which the document should be split. Finally, the sample markdown text is split and the resultant chunks are printed in the output.
from langchain_text_splitters import MarkdownHeaderTextSplitter # Example Markdown content markdown_content = """ # Main Title This is an introductory paragraph under the main title. ## Subsection 1 Content for subsection 1. ## Subsection 2 Content for subsection 2. ### Sub-subsection 2.1 Content for sub-subsection 2.1. """ # Initialize the MarkdownHeaderTextSplitter with headers to split on splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"),]) # Split the Markdown content chunks = splitter.split_text(markdown_content) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}:\n{chunk}\n")
Output
Chunk 1: page_content='This is an introductory paragraph under the main title.' metadata={'Header 1': 'Main Title'} Chunk 2: page_content='Content for subsection 1.' metadata={'Header 1': 'Main Title', 'Header 2': 'Subsection 1'} Chunk 3: page_content='Content for subsection 2.' metadata={'Header 1': 'Main Title', 'Header 2': 'Subsection 2'} Chunk 4: page_content='Content for sub-subsection 2.1.' metadata={'Header 1': 'Main Title', 'Header 2': 'Subsection 2', 'Header 3': 'Sub-subsection 2.1'}
LangChain HTMLHeaderTextSplitter : Split By HTML Header
HTMLHeaderTextSplitter is used for splitting HTML data and is similar to MarkdownHeaderTextSplitter which was discussed above. It splits the HTML document based on header elements and returns metadata for each header associated with the chunk making it structure-aware splitting.
Example
In this example, we first import the HTMLHeaderTextSplitter module from the langchain_text_splitters package. Next, we initialize HTML Header Text Splitter by defining HTML heading tags on which the HTML document has to be split. Finally, we split the sample HTML into chunks and display the output.
from langchain_text_splitters import HTMLHeaderTextSplitter # Example HTML content html_content = """ <h1>This is H1 Tag</h1> <p>Introductory paragraph under the H1 tage.</p> <h2>This is H2 Tag</h2> <p>Content for H2 Tag.</p> <h2>This is H2 Tag</h2> <p>Content for H2 Tag.</p> <h3>This is H3 Tag</h3> <p>Content for H2 Tag.</p> """ # Initialize the HTMLHeaderTextSplitter with the headers to split on splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"),]) # Split the HTML content chunks = splitter.split_text(html_content) #print(chunks) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}:\n{chunk}\n")
Output
Chunk 1: page_content='Introductory paragraph under the H1 tage.' metadata={'Header 1': 'This is H1 Tag'} Chunk 2: page_content='Content for H2 Tag. Content for H2 Tag.' metadata={'Header 1': 'This is H1 Tag', 'Header 2': 'This is H2 Tag'} Chunk 3: page_content='Content for H2 Tag.' metadata={'Header 1': 'This is H1 Tag', 'Header 2': 'This is H2 Tag', 'Header 3': 'This is H3 Tag'}
LangChain HTMLSectionSplitter : Split by HTML Section
HTMLSectionSplitter is also similar to HTMLHeaderTextSplitter conceptually which splits the HTML document based on HTML element. It is also a structured-aware splitter that adds metadata for the header on which the split has been done.
Example
Here HTMLSectionSplitter module is imported first. Then we initialize HTML Section Splitter by defining HTML heading tags on which the HTML document has to be split. Finally, the sample HTML is split into chunks and the output is printed.
from langchain_text_splitters import HTMLSectionSplitter # Example HTML content html_content = """ <h1>This is H1 Tag</h1> <p>Introductory paragraph under the H1 tage.</p> <h2>This is H2 Tag</h2> <p>Content for H2 Tag.</p> <h2>This is H2 Tag</h2> <p>Content for H2 Tag.</p> <h3>This is H3 Tag</h3> <p>Content for H2 Tag.</p> """ # Initialize the HTMLSectionSplitter with the section tag to split on splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"),]) # Split the HTML content chunks = splitter.split_text(html_content) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}:\n{chunk}\n")
Output
Chunk 1: page_content='This is H1 Tag Introductory paragraph under the H1 tage.' metadata={'Header 1': 'This is H1 Tag'} Chunk 2: page_content='This is H2 Tag Content for H2 Tag.' metadata={'Header 2': 'This is H2 Tag'} Chunk 3: page_content='This is H2 Tag Content for H2 Tag.' metadata={'Header 2': 'This is H2 Tag'} Chunk 4: page_content='This is H3 Tag Content for H2 Tag.' metadata={'Header 3': 'This is H3 Tag'}
LangChain Code splitter
There is no separate splitter in LangChain for chunking codes in LangChain Text Splitters. Instead, RecursiveCharacterTextSplitter can be used to split the code by using its language parameter. Currently, 24 programming languages are supported by LangChain, and the separator to be used for splitting is predefined for each language.
Example
In this example, initially, we import the Language and RecursiveCharacterTextSplitter modules from langchain_text_splitters package. We then initialize RecursiveCharacterTextSplitter by using the language parameter as Python. We then split the sample Python code and print the chunks in the output.
from langchain_text_splitters import ( Language, RecursiveCharacterTextSplitter, ) PYTHON_CODE = """ def function_one(): # Function one implementation pass def function_two(): # Function two implementation pass class ExampleClass: def method_one(self): # Method one implementation pass def method_two(self): # Method two implementation pass """ python_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PYTHON, chunk_size=100, chunk_overlap=20 ) chunks = python_splitter.create_documents([PYTHON_CODE]) # Output the chunks for i, chunk in enumerate(chunks): print(f"Chunk {i+1}:\n{chunk}\n")
Output
Chunk 1: page_content='def function_one(): # Function one implementation pass' Chunk 2: page_content='def function_two(): # Function two implementation pass' Chunk 3: page_content='class ExampleClass: def method_one(self): # Method one implementation' Chunk 4: page_content='pass def method_two(self): # Method two implementation' Chunk 5: page_content='pass'
Summary
Let us summarize all the splitters of the LangChain Text Splitters package that we discussed in the above example.
Splitter Type | Description |
---|---|
CharacterTextSplitter | Splits text based on a single character separator |
TokenTextSplitter | Splits text based on token count rather than character count |
RecursiveCharacterTextSplitter | Splits text based on a hierarchy of separators, prioritizing natural boundaries.
Also used for splitting codes of various programming languages based on their syntax and separators. |
RecursiveJsonSplitter | Splits JSON-formatted text based on the JSON structure |
HTMLHeaderTextSplitter | Splits HTML documents based on their HTML header structure |
HTMLSectionSplitter | Splits HTML documents based on the HTML section |
MarkdownHeaderTextSplitter | Splits Markdown documents based on their header structure |
Reference: LangChain Documentation