Text chunker → JSONL
Split plain text into fixed-size windows with overlap and download as JSONL for RAG ingestion.
Characters: 262
RAG Chunking Parameters and JSONL Export Guide
RAG systems do not push entire documents into model context. They split content into retrievable chunks and store them in a vector database. Chunks that are too large hurt precision; chunks that are too small lose context.
FormaX RAG Chunker uses fixed character windows with overlap. It is useful for knowledge bases, support docs, product manuals, and Markdown material during prototype and parameter exploration.
How to Tune Parameters
- Start around 400 characters per chunk with 50 to 80 characters of overlap.
- Increase chunk size for technical specs and API docs; decrease it for chats and FAQ content.
- Keep chunk_index, start, and end after JSONL export so answers can cite the source text.
- Before production, evaluate retrieval using real user questions, not only the number of chunks.
JSONL Output Shape
{"chunk_index":0,"text":"...","metadata":{"start":0,"end":400,"source":"formax-rag-chunker"}}Boundaries
- Character windows are not token windows; re-check lengths against your embedding model.
- Remove headers, footers, repeated copyright text, and navigation noise before chunking.
- More overlap is not always better; it increases vectors, storage cost, and duplicate retrieval.
FAQ
What chunk size should I use?
There is no universal number. Start at 400 characters and adjust after checking whether real queries retrieve complete context.
Why use overlap?
Overlap reduces missed context when sentences are split at boundaries, but too much overlap increases duplication and cost.