📝
FormaX TeamMay 18, 20266 min

RAG Text Chunking Guide: Overlap, JSONL Export, and Tuning

Prepare documents for vector search and RAG: fixed windows, overlapping chunks, JSONL export — all in your browser.

RAGAIData Processing

Why chunking is non-negotiable for RAG

Retrieval-augmented generation splits long documents into retrievable segments, embeds them, and pulls only relevant chunks at query time. Poor chunking means missed context and hallucination-prone answers.

Character windows vs token windows

Production stacks often chunk by token to align with embedding limits. For fast experiments, fixed character windows with overlap in the browser is enough. FormaX RAG Text Chunker uses that model: chunk size ≥32, overlap < chunk size—not token-accurate, but ideal to validate pipelines before swapping in tiktoken.

Tuning overlap

  • Too little: sentences split across boundaries may never appear whole in one chunk.
  • Too much: redundant vectors and storage cost.
  • Starting point: ~400 chars with 50–80 overlap; increase chunk size for specs, decrease for chat logs.

JSONL shape

{"chunk_index":0,"text":"...","char_count":400,"metadata":{"start":0,"end":400,"source":"formax-rag-chunker"}}

Suggested workflow

  1. Paste clean Markdown or plain text.
  2. Preview chunks, tune parameters.
  3. Download chunks.jsonl for your vector DB; keep indices for citations.
  4. Evaluate recall with real user questions before production.

All processing stays local. Try the chunker.