Detailed Guide to LangChain Text Splitters with Examples

What are LangChain Text Splitters

In recent times LangChain has evolved into a go-to framework for creating complex pipelines for working with LLMs. One of its important utility is the langchain_text_splitters package which contains various modules to split large textual data into more manageable chunks. Usually, LangChain Text Splitters are used in RAG architecture to chunk a large document and convert these chunks into embeddings to be stored in Vector DB. For LLMs with limited context-size windows, it is quite useful to retrieve relevant chunks of the document from Vector DB and pass it as context while inferencing.

LangChain Text Splitters offers the following types of splitters that are useful for different types of textual data or as per your splitting requirement.

  • CharacterTextSplitter
  • TokenTextSplitter
  • RecursiveCharacterTextSplitter
  • RecursiveJsonSplitter
  • HTMLHeaderTextSplitter
  • HTMLSectionSplitter
  • MarkdownHeaderTextSplitter
  • Code Splitter

We will cover the above splitters of langchain_text_splitters package one by one in detail with examples in the following sections.

Installation of langchain_text_splitters

The LangChain Text Splitters package can be installed by the following command. This is an important pre-requisite of using langchain_text_splitters examples in the subsequent sections.

%pip install -qU langchain-text-splitters

 

LangChain CharacterTextSplitter : Split By Character

The CharacterTextSplitter is a basic LangChain text splitter that splits the text based on a character separator, such as a space or a new line. By default, the separator is “\n\n”. This splitter is useful for unstructured text data that can be split based on special characters like newline, semi-colon, dot, etc.

Example

In this example, we first import CharacterTextSplitter module from langchain_text_splitters package. Next, we initialize the character text splitter with separator parameter as a semi-colon. We also pass chunk_size as 200 here which is calculated based on character length. It should be noted here that CharacterTextSplitter will not strictly split on chunk size, instead it will prioritize the split based on separator only but may throw an error or warning if the size of the resultant chunk exceeds chunk_size.

Finally, we split the text into chunks and the output is printed where we can see the text has been split based on the semi-colon.

 

from langchain_text_splitters import CharacterTextSplitter

# Example text
text = """Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering a wide range of AI-driven applications. 
As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in the market."""

# Initialize the CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=200,separator=',')

# Split the text
chunks = splitter.split_text(text)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Output

Chunk 1: Vector databases have emerged as powerful tools for managing high-dimensional data

Chunk 2: enabling efficient similarity searches and powering a wide range of AI-driven applications. 
As the demand for advanced data management solutions continues to grow

Chunk 3: several vector databases have gained prominence in the market.

 

LangChain RecursiveCharacterTextSplitter : Split Recursively by Character

To split a non-structured text more generically, it is recommended to use this RecursiveCharacterTextSplitter which splits the text in order of [“\n\n”, “\n”, ” “, “”]. It essentially starts splitting with double newlines (\n\n), then single newlines (\n), spaces (‘ ‘), and finally, individual characters.  This approach ensures that depending on the chunk_size, the natural structure and coherence of the text are preserved while splitting at natural boundaries like paragraphs and sentences.

Example

We start by importing the RecursiveCharacterTextSplitter module from the langchain_text_splitters package. To initialize Recursive Character Text Splitter we use chunk_size as 100 and chunk_overlap size as 10. The chunk_overlap signifies the number of characters that can overlap between two chunks.

Finally, the text is split and the chunks are printed as output.

 

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Example text
text = """Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering a wide range of AI-driven applications. 
As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in the market."""

# Initialize the RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)

# Split the text  
chunks = splitter.split_text(text)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Output

Chunk 1: Vector databases have emerged as powerful tools for managing high-dimensional data, enabling

Chunk 2: enabling efficient similarity searches and powering a wide range of AI-driven applications.

Chunk 3: As the demand for advanced data management solutions continues to grow, several vector databases

Chunk 4: databases have gained prominence in the market.

 

LangChain TokenTextSplitter : Split By Token

The TokenTextSplitter splits the text based on token count rather than character count. This is useful since popular LLMs have context windows designed for token counts. How a token is constructed depends on the underlying tokenizer in use. LangChain currently supports the following tokenizers  –  Tiktoken, Spacy, SentenceTransformers, NLTK, HuggingFace Tokenizer, and KoNLPY. By default, Tiktoken is used.

Example

In this example, we start by importing the TokenTextSplitter module and then initialize the token text splitter with chunk_size as 20 and chunk_overlap as 5. The chunk_overlap determines how many tokens can overlap between two chunks.

Finally, the text is split and chunks are printed as output.

 

from langchain_text_splitters import TokenTextSplitter

# Example text
text = """Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering a wide range of AI-driven applications. 
As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in the market."""

# Initialize the TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=20, chunk_overlap=5)

# Split the text
chunks = splitter.split_text(text)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Output

Chunk 1: Vector databases have emerged as powerful tools for managing high-dimensional data, enabling efficient similarity searches and powering

Chunk 2:  efficient similarity searches and powering a wide range of AI-driven applications. 
As the demand for

Chunk 3: 
As the demand for advanced data management solutions continues to grow, several vector databases have gained prominence in

Chunk 4:  databases have gained prominence in the market.

 

LangChain RecursiveJsonSplitter : Recursively Split JSON

The RecursiveJsonSplitter splits the JSON data into smaller JSON chunks by recursively traversing the JSON and maintaining the JSON structure in chunks. It also tries to preserve the nested JSON object entirely as long as the min and max chunk size criteria are matched, otherwise it will split it.

Example

In this example, we first import the RecursiveJsonSplitter module and then initialize the Recursive Json Splitter with max_chunk size as 500. Finally, we split the sample JSON text and print the resultant JSON chunks as output.

 

import requests
from langchain_text_splitters import RecursiveJsonSplitter

# Example JSON data
json_data = {
    "section1": {
        "title": "Main Section 1",
        "content": "This is the content for main section 1.",
        "subsections": [
            {
                "title": "Subsection 1.1",
                "content": "Content for subsection 1.1.",
                "details": ["Detail 1.1.1", "Detail 1.1.2", "Detail 1.1.3"]
            },
            {
                "title": "Subsection 1.2",
                "content": "Content for subsection 1.2.",
                "details": ["Detail 1.2.1", "Detail 1.2.2", "Detail 1.2.3"]
            }
        ]
    },
    "section2": {
        "title": "Main Section 2",
        "content": "This is the content for main section 2.",
        "subsections": [
            {
                "title": "Subsection 2.1",
                "content": "Content for subsection 2.1.",
                "details": ["Detail 2.1.1", "Detail 2.1.2", "Detail 2.1.3"]
            },
            {
                "title": "Subsection 2.2",
                "content": "Content for subsection 2.2.",
                "details": ["Detail 2.2.1", "Detail 2.2.2", "Detail 2.2.3"]
            }
        ]
    },
    "section3": {
        "title": "Main Section 3",
        "content": "This is the content for main section 3.",
        "subsections": [
            {
                "title": "Subsection 3.1",
                "content": "Content for subsection 3.1.",
                "details": ["Detail 3.1.1", "Detail 3.1.2", "Detail 3.1.3"]
            },
            {
                "title": "Subsection 3.2",
                "content": "Content for subsection 3.2.",
                "details": ["Detail 3.2.1", "Detail 3.2.2", "Detail 3.2.3"]
            }
        ]
    }
}


# Initialize the RecursiveJsonSplitter
splitter = RecursiveJsonSplitter(max_chunk_size=500)

# Split the JSON data
chunks = splitter.split_text(json_data)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Output

Chunk 1: {"section1": {"title": "Main Section 1", "content": "This is the content for main section 1.", "subsections": [{"title": "Subsection 1.1", "content": "Content for subsection 1.1.", "details": ["Detail 1.1.1", "Detail 1.1.2", "Detail 1.1.3"]}, {"title": "Subsection 1.2", "content": "Content for subsection 1.2.", "details": ["Detail 1.2.1", "Detail 1.2.2", "Detail 1.2.3"]}]}}

Chunk 2: {"section2": {"title": "Main Section 2", "content": "This is the content for main section 2.", "subsections": [{"title": "Subsection 2.1", "content": "Content for subsection 2.1.", "details": ["Detail 2.1.1", "Detail 2.1.2", "Detail 2.1.3"]}, {"title": "Subsection 2.2", "content": "Content for subsection 2.2.", "details": ["Detail 2.2.1", "Detail 2.2.2", "Detail 2.2.3"]}]}}

Chunk 3: {"section3": {"title": "Main Section 3", "content": "This is the content for main section 3.", "subsections": [{"title": "Subsection 3.1", "content": "Content for subsection 3.1.", "details": ["Detail 3.1.1", "Detail 3.1.2", "Detail 3.1.3"]}, {"title": "Subsection 3.2", "content": "Content for subsection 3.2.", "details": ["Detail 3.2.1", "Detail 3.2.2", "Detail 3.2.3"]}]}}

 

LangChain MarkdownHeaderTextSplitter : Markdown Header Text Splitter

Markdown is a syntax that provides structure to textual data by adding Headings, Sub-headings, etc. So while chunking a markdown document it is important to preserve the boundary of different headings as much as possible. The MarkdownHeaderTextSplitter is used to split Markdown documents based on their header structure. It preserves the header metadata in the resulting chunks, allowing for structure-aware splitting.

Example

We start by importing the MarkdownHeaderTextSplitter module and then we initialize the Markdown Header Text Splitter by defining the markdown headers on which the document should be split. Finally, the sample markdown text is split and the resultant chunks are printed in the output.

 

from langchain_text_splitters import MarkdownHeaderTextSplitter

# Example Markdown content
markdown_content = """
# Main Title
This is an introductory paragraph under the main title.

## Subsection 1
Content for subsection 1.

## Subsection 2
Content for subsection 2.

### Sub-subsection 2.1
Content for sub-subsection 2.1.
"""

# Initialize the MarkdownHeaderTextSplitter with headers to split on
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),])

# Split the Markdown content
chunks = splitter.split_text(markdown_content)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Output

Chunk 1:
page_content='This is an introductory paragraph under the main title.' metadata={'Header 1': 'Main Title'}

Chunk 2:
page_content='Content for subsection 1.' metadata={'Header 1': 'Main Title', 'Header 2': 'Subsection 1'}

Chunk 3:
page_content='Content for subsection 2.' metadata={'Header 1': 'Main Title', 'Header 2': 'Subsection 2'}

Chunk 4:
page_content='Content for sub-subsection 2.1.' metadata={'Header 1': 'Main Title', 'Header 2': 'Subsection 2', 'Header 3': 'Sub-subsection 2.1'}

 

LangChain HTMLHeaderTextSplitter : Split By HTML Header

HTMLHeaderTextSplitter is used for splitting HTML data and is similar to MarkdownHeaderTextSplitter which was discussed above. It splits the HTML document based on header elements and returns metadata for each header associated with the chunk making it structure-aware splitting.

Example

In this example, we first import the HTMLHeaderTextSplitter module from the langchain_text_splitters package. Next, we initialize HTML Header Text Splitter by defining HTML heading tags on which the HTML document has to be split. Finally, we split the sample HTML into chunks and display the output.

 

from langchain_text_splitters import HTMLHeaderTextSplitter

# Example HTML content
html_content = """
<h1>This is H1 Tag</h1>
<p>Introductory paragraph under the H1 tage.</p>
<h2>This is H2 Tag</h2>
<p>Content for H2 Tag.</p>
<h2>This is H2 Tag</h2>
<p>Content for H2 Tag.</p>
<h3>This is H3 Tag</h3>
<p>Content for H2 Tag.</p>
"""

# Initialize the HTMLHeaderTextSplitter with the headers to split on
splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),])

# Split the HTML content
chunks = splitter.split_text(html_content)

#print(chunks)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Output

Chunk 1:
page_content='Introductory paragraph under the H1 tage.' metadata={'Header 1': 'This is H1 Tag'}

Chunk 2:
page_content='Content for H2 Tag.  
Content for H2 Tag.' metadata={'Header 1': 'This is H1 Tag', 'Header 2': 'This is H2 Tag'}

Chunk 3:
page_content='Content for H2 Tag.' metadata={'Header 1': 'This is H1 Tag', 'Header 2': 'This is H2 Tag', 'Header 3': 'This is H3 Tag'}

 

LangChain HTMLSectionSplitter : Split by HTML Section

HTMLSectionSplitter is also similar to HTMLHeaderTextSplitter conceptually which splits the HTML document based on HTML element. It is also a structured-aware splitter that adds metadata for the header on which the split has been done.

Example

Here HTMLSectionSplitter module is imported first. Then we initialize HTML Section Splitter by defining HTML heading tags on which the HTML document has to be split. Finally, the sample HTML is split into chunks and the output is printed.

 

from langchain_text_splitters import HTMLSectionSplitter

# Example HTML content
html_content = """
<h1>This is H1 Tag</h1>
<p>Introductory paragraph under the H1 tage.</p>
<h2>This is H2 Tag</h2>
<p>Content for H2 Tag.</p>
<h2>This is H2 Tag</h2>
<p>Content for H2 Tag.</p>
<h3>This is H3 Tag</h3>
<p>Content for H2 Tag.</p>
"""

# Initialize the HTMLSectionSplitter with the section tag to split on
splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),])

# Split the HTML content
chunks = splitter.split_text(html_content)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Output

Chunk 1:
page_content='This is H1 Tag 
 Introductory paragraph under the H1 tage.' metadata={'Header 1': 'This is H1 Tag'}

Chunk 2:
page_content='This is H2 Tag 
 Content for H2 Tag.' metadata={'Header 2': 'This is H2 Tag'}

Chunk 3:
page_content='This is H2 Tag 
 Content for H2 Tag.' metadata={'Header 2': 'This is H2 Tag'}

Chunk 4:
page_content='This is H3 Tag 
 Content for H2 Tag.' metadata={'Header 3': 'This is H3 Tag'}

 

LangChain Code splitter

There is no separate splitter in LangChain for chunking codes in LangChain Text Splitters. Instead, RecursiveCharacterTextSplitter can be used to split the code by using its language parameter. Currently, 24 programming languages are supported by LangChain, and the separator to be used for splitting is predefined for each language.

Example

In this example, initially, we import the Language and RecursiveCharacterTextSplitter modules from langchain_text_splitters package. We then initialize RecursiveCharacterTextSplitter by using the language parameter as Python. We then split the sample Python code and print the chunks in the output.

 

from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

PYTHON_CODE = """
    def function_one():
        # Function one implementation
        pass

    def function_two():
        # Function two implementation
        pass

    class ExampleClass:
        def method_one(self):
            # Method one implementation
            pass

        def method_two(self):
            # Method two implementation
            pass
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=100, chunk_overlap=20
)

chunks = python_splitter.create_documents([PYTHON_CODE])

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Output

Chunk 1:
page_content='def function_one():
        # Function one implementation
        pass'

Chunk 2:
page_content='def function_two():
        # Function two implementation
        pass'

Chunk 3:
page_content='class ExampleClass:
        def method_one(self):
            # Method one implementation'

Chunk 4:
page_content='pass
        
        def method_two(self):
            # Method two implementation'

Chunk 5:
page_content='pass'

 

Summary

Let us summarize all the splitters of the LangChain Text Splitters package that we discussed in the above example.

Splitter Type Description
CharacterTextSplitter Splits text based on a single character separator
TokenTextSplitter Splits text based on token count rather than character count
RecursiveCharacterTextSplitter Splits text based on a hierarchy of separators, prioritizing natural boundaries.

Also used for splitting codes of various programming languages based on their syntax and separators.

RecursiveJsonSplitter Splits JSON-formatted text based on the JSON structure
HTMLHeaderTextSplitter Splits HTML documents based on their HTML header structure
HTMLSectionSplitter Splits HTML documents based on the HTML section
MarkdownHeaderTextSplitter Splits Markdown documents based on their header structure

 

Reference: LangChain Documentation

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *