TokenTextSplitter
Finally, TokenTextSplitter
splits a raw text string by first converting the text into BPE tokens, then split these tokens into chunks and convert the tokens within a single chunk back into text.
import { Document } from "langchain/document";
import { TokenTextSplitter } from "langchain/text_splitter";
const text = "foo bar baz 123";
const splitter = new TokenTextSplitter({
encodingName: "gpt2",
chunkSize: 10,
chunkOverlap: 0,
});
const output = await splitter.createDocuments([text]);