-
Notifications
You must be signed in to change notification settings - Fork 3.1k
improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests #2539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…gurations, added tests
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
Greptile SummaryThis PR refactors the chunking system to use consistent units (tokens vs characters) and respect user-specified chunk configurations across all chunker types. The changes improve clarity by renaming parameters ( Key improvements:
Issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant UI as CreateBaseModal
participant API as /api/knowledge
participant Service as DocumentService
participant Processor as DocumentProcessor
participant Chunker as TextChunker/JsonYamlChunker/StructuredDataChunker
User->>UI: Configure chunking (maxSize, minSize, overlap)
Note over UI: Units: maxSize=tokens, minSize=chars, overlap=tokens
UI->>UI: Validate: minSize < (maxSize × 4)
UI->>API: POST with chunkingConfig
API->>API: Validate with Zod schema
Note over API: maxSize: 100-4000 tokens<br/>minSize: 1-2000 chars<br/>overlap: 0-500 tokens
API->>Service: Create KB with config
User->>Service: Upload document
Service->>Processor: processDocument(chunkSize, chunkOverlap, minCharactersPerChunk)
Note over Processor: Maps config:<br/>maxSize→chunkSize<br/>overlap→chunkOverlap<br/>minSize→minCharactersPerChunk
Processor->>Processor: Detect file type
alt JSON/YAML
Processor->>Chunker: JsonYamlChunker(chunkSize, minCharactersPerChunk)
Chunker->>Chunker: Split by structure, filter by minCharactersPerChunk
else CSV/XLSX
Processor->>Chunker: StructuredDataChunker(chunkSize)
Chunker->>Chunker: Calculate rows/chunk based on chunkSize
else Text/Markdown
Processor->>Chunker: TextChunker(chunkSize, chunkOverlap, minCharactersPerChunk)
Chunker->>Chunker: Clamp overlap to 50% of chunkSize
Chunker->>Chunker: Split hierarchically by separators
Chunker->>Chunker: Add overlap (tokens→chars conversion)
Chunker->>Chunker: Calculate metadata (startIndex, endIndex)
end
Chunker-->>Processor: Return chunks with token counts
Processor-->>Service: Return processed chunks
Service->>Service: Generate embeddings
Service->>Service: Store in vector DB
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
19 files reviewed, 5 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
19 files reviewed, 2 comments
…gurations, added tests (#2539) * improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests * ack PR commnets * updated docs * cleanup
Summary
minCharactersPerChunk,maxChunkSize,chunkOverlapfixes #2510
Type of Change
Testing
Tested manually
Checklist