The division of texts into chunks is an important step to build a good vector database. When embeddings are created for RAG systems, the size and meaning of the segments directly affect the accuracy and relevance of search results. Chunks that are too short break the content too much, while chunks that are too long risk joining unrelated information, making queries less effective.

The whole process depends on the tokenizer used. Finding the right sentence boundaries is important to apply good chunking strategies.

Image

The histogram in the figure, made from a literary text, shows the distribution of 1011 sentences, with an average length of 118.48 characters and a standard deviation of 94.49. This shows that the sentence lengths in the corpus are very different.

A limit of the current method comes from the asymmetric distribution of sentence lengths, with a long tail on the right (see graph). This means that one single chunk_size value may not be good for the whole corpus, and very long sentences may need special treatment.

The graph also shows that most sentences are under 200 characters, with many between 50 and 150.

The purple line shows the average, while the dotted lines show the standard deviation (+212.97 and -23.99), giving a quick view of the variability in the corpus.

Finally, two segmentation parameters were calculated: a chunk size of 401 characters (red line) and an overlap of 141 (green line), for a total of about 461 chunks. The colored areas – green for overlap and pink for chunk size – make it easy to see how these values fit with the real sentences.

The library sentenza, written during these holidays to help me with text analysis and used to make this graph, is still experimental, but it can give useful support to improve the text chunking process.

Link: sentenza