Question 1

What chunk size should I use?

Accepted Answer

There is no universal number. Start at 400 characters and adjust after checking whether real queries retrieve complete context.

Question 2

Why use overlap?

Accepted Answer

Overlap reduces missed context when sentences are split at boundaries, but too much overlap increases duplication and cost.

Question 3

What is the difference between characters and tokens?

Accepted Answer

Chinese is roughly 1-2 tokens per character; English is about 4 characters per token. Re-estimate chunk size in tokens before choosing an embedding model.

Question 4

What is the exported JSONL format?

Accepted Answer

One JSON object per line with chunk_index, text, and metadata (including start and end positions), ready for vector database import.

Question 5

Do Markdown headings affect chunking?

Accepted Answer

Chunking uses character windows and does not automatically respect Markdown heading boundaries. Pre-split by headings before input for better results.

Question 6

Can it process PDF or Word documents?

Accepted Answer

The tool accepts plain text. Extract text from PDF or Word first (using FormaX PDF-to-Word or similar tools), then paste into the chunker.

Question 7

What overlap setting works best?

Accepted Answer

Aim for 10%-20% of chunk size. For example, 400-character chunks with 50-80 characters of overlap, then fine-tune based on retrieval quality.

Question 8

How do I evaluate chunking quality?

Accepted Answer

Test with 10-20 real user questions and check whether retrieved chunks contain complete answer context, not just the chunk count.

Text chunker → JSONL

RAG Chunking Parameters and JSONL Export Guide

Common Use Cases

How to Tune Parameters

JSONL Output Shape

Boundaries

FAQ