Skip to content

Conversation

@waleedlatif1
Copy link
Collaborator

@waleedlatif1 waleedlatif1 commented Dec 23, 2025

Summary

  • improve chunkers
  • respect user-specified chunk configurations for non-text chunkers
  • added UI hints for units for minCharactersPerChunk, maxChunkSize, chunkOverlap
  • added tests

fixes #2510

Type of Change

  • Other: Improvement

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Dec 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Review Updated (UTC)
docs Skipped Skipped Dec 23, 2025 4:47am

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 23, 2025

Greptile Summary

This PR refactors the chunking system to use consistent units (tokens vs characters) and respect user-specified chunk configurations across all chunker types. The changes improve clarity by renaming parameters (chunkOverlap, minCharactersPerChunk) and adding comprehensive documentation throughout the codebase.

Key improvements:

  • Renamed overlapchunkOverlap and minChunkSizeminCharactersPerChunk for clarity
  • Added unit specifications: maxSize in tokens, minSize in characters, overlap in tokens
  • Non-text chunkers (JSON/YAML, structured data) now respect user's chunkSize preference
  • Default minSize changed from 1 to 100 characters to filter tiny fragments
  • UI labels now show units: "Max Chunk Size (tokens)", "Min Chunk Size (characters)", etc.
  • Added 265 lines of comprehensive tests for TextChunker
  • API validation properly compares cross-unit values (tokens × 4 vs characters)

Issues found:

  • Update route documentation incorrectly states overlap is in "characters" instead of "tokens"
  • Silent overlap clamping (max 50% of chunk size) may surprise users who set higher values

Confidence Score: 4/5

  • This PR is safe to merge with one documentation fix needed
  • The refactor is well-executed with comprehensive tests, clear documentation, and proper unit handling. One logic issue exists: the update route has incorrect documentation (overlap units). The silent overlap clamping is a design choice that could be improved with validation feedback. Overall implementation is sound, respects user preferences, and improves code clarity significantly.
  • Pay attention to apps/sim/app/api/knowledge/[id]/route.ts - fix the overlap documentation from "characters" to "tokens"

Important Files Changed

Filename Overview
apps/sim/lib/chunkers/text-chunker.ts Refactored to use tokens consistently with overlap clamping and improved documentation. Silent clamping could surprise users.
apps/sim/lib/chunkers/types.ts Renamed parameters for clarity (chunkOverlap, minCharactersPerChunk) with clear unit documentation. Clean refactor.
apps/sim/app/api/knowledge/route.ts Updated validation with improved unit documentation and cross-unit comparison (tokens to characters). Good defaults changed from minSize: 1 to 100.
apps/sim/app/api/knowledge/[id]/route.ts Added validation for update schema, but overlap documentation is incorrect (says characters, should be tokens).
apps/sim/lib/chunkers/text-chunker.test.ts Comprehensive test suite added with 265 lines covering edge cases, overlap, hierarchical splitting, and unicode text.

Sequence Diagram

sequenceDiagram
    participant User
    participant UI as CreateBaseModal
    participant API as /api/knowledge
    participant Service as DocumentService
    participant Processor as DocumentProcessor
    participant Chunker as TextChunker/JsonYamlChunker/StructuredDataChunker
    
    User->>UI: Configure chunking (maxSize, minSize, overlap)
    Note over UI: Units: maxSize=tokens, minSize=chars, overlap=tokens
    UI->>UI: Validate: minSize < (maxSize × 4)
    UI->>API: POST with chunkingConfig
    API->>API: Validate with Zod schema
    Note over API: maxSize: 100-4000 tokens<br/>minSize: 1-2000 chars<br/>overlap: 0-500 tokens
    API->>Service: Create KB with config
    
    User->>Service: Upload document
    Service->>Processor: processDocument(chunkSize, chunkOverlap, minCharactersPerChunk)
    Note over Processor: Maps config:<br/>maxSize→chunkSize<br/>overlap→chunkOverlap<br/>minSize→minCharactersPerChunk
    
    Processor->>Processor: Detect file type
    
    alt JSON/YAML
        Processor->>Chunker: JsonYamlChunker(chunkSize, minCharactersPerChunk)
        Chunker->>Chunker: Split by structure, filter by minCharactersPerChunk
    else CSV/XLSX
        Processor->>Chunker: StructuredDataChunker(chunkSize)
        Chunker->>Chunker: Calculate rows/chunk based on chunkSize
    else Text/Markdown
        Processor->>Chunker: TextChunker(chunkSize, chunkOverlap, minCharactersPerChunk)
        Chunker->>Chunker: Clamp overlap to 50% of chunkSize
        Chunker->>Chunker: Split hierarchically by separators
        Chunker->>Chunker: Add overlap (tokens→chars conversion)
        Chunker->>Chunker: Calculate metadata (startIndex, endIndex)
    end
    
    Chunker-->>Processor: Return chunks with token counts
    Processor-->>Service: Return processed chunks
    Service->>Service: Generate embeddings
    Service->>Service: Store in vector DB
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1 waleedlatif1 merged commit 37443a7 into staging Dec 23, 2025
6 checks passed
@waleedlatif1 waleedlatif1 deleted the improvement/kb branch December 23, 2025 04:47
waleedlatif1 added a commit that referenced this pull request Dec 23, 2025
…gurations, added tests (#2539)

* improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests

* ack PR commnets

* updated docs

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants