code-chunk

AST-aware code chunking for semantic search and RAG pipelines.

Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures.

- Features
- How It Works
- Installation
- Quickstart
- API Reference
- License

Features

- AST-aware: Splits at semantic boundaries, never mid-function
- Rich context: Scope chain, imports, siblings, entity signatures
- Contextualized text: Pre-formatted for embedding models
- Multi-language: TypeScript, JavaScript, Python, Rust, Go, Java
- Batch processing: Process entire codebases with controlled concurrency
- Streaming: Process large files incrementally
- Effect support: First-class Effect integration

How It Works

Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. code-chunk takes a different approach:

$3

Source code is parsed into an Abstract Syntax Tree (AST) using tree-sitter. This gives us a structured representation of the code that understands language grammar.

$3

We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture:
- Name and type
- Full signature (e.g., async getUser(id: string): Promise)
- Docstring/comments if present
- Byte and line ranges

$3

Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like UserService > getUser.

$3

Code is split at semantic boundaries while respecting the maxChunkSize limit. The chunker:
- Prefers to keep complete entities together
- Splits oversized entities at logical points (statement boundaries)
- Never cuts mid-expression or mid-statement
- Merges small adjacent chunks to reduce fragmentation

$3

Each chunk is enriched with contextual metadata:
- Scope chain: Where this code lives (e.g., inside which class/function)
- Entities: What's defined in this chunk
- Siblings: What comes before/after (for continuity)
- Imports: What dependencies are used

This context is formatted into contextualizedText, optimized for embedding models to understand semantic relationships.

Installation

``bash bun add code-chunk

`or`


npm install code-chunk


Quickstart
$3

`typescript import { chunk } from 'code-chunk'

const chunks = await chunk('src/user.ts', sourceCode)

for (const c of chunks) { console.log(c.text) console.log(c.context.scope) // [{ name: 'UserService', type: 'class' }] console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }] }`

`$3`

Use contextualizedText for better embedding quality in RAG systems:

`typescript for (const c of chunks) { const embedding = await embed(c.contextualizedText) await vectorDB.upsert({ id:${filepath}:${c.index}, embedding, metadata: { filepath, lines: c.lineRange } }) }`

The contextualizedText prepends semantic context to the raw code:

`src/services/user.ts`


Scope: UserService

Defines: async getUser(id: string): Promise

Uses: Database

After: constructor

async getUser(id: string): Promise { return this.db.query('SELECT * FROM users WHERE id = ?', [id]) }`

`$3`

Process chunks incrementally without loading everything into memory:

`typescript import { chunkStream } from 'code-chunk'

for await (const c of chunkStream('src/large.ts', code)) { await process(c) }`

`$3`

Create a chunker instance when processing multiple files with the same config:

`typescript import { createChunker } from 'code-chunk'

const chunker = createChunker({ maxChunkSize: 2048, contextMode: 'full', siblingDetail: 'signatures', })

for (const file of files) { const chunks = await chunker.chunk(file.path, file.content) }`

`$3`

Process multiple files concurrently with error handling per file:

`typescript import { chunkBatch } from 'code-chunk'

const files = [ { filepath: 'src/user.ts', code: userCode }, { filepath: 'src/auth.ts', code: authCode }, { filepath: 'lib/utils.py', code: utilsCode }, ]

const results = await chunkBatch(files, { maxChunkSize: 1500, concurrency: 10, onProgress: (done, total, path, success) => { console.log([${done}/${total}] ${path}: ${success ? 'ok' : 'failed'}) } })

for (const result of results) { if (result.error) { console.error(Failed: ${result.filepath}, result.error) } else { await indexChunks(result.filepath, result.chunks) } }`

Stream results as they complete:

`typescript import { chunkBatchStream } from 'code-chunk'

for await (const result of chunkBatchStream(files, { concurrency: 5 })) { if (result.chunks) { await indexChunks(result.filepath, result.chunks) } }`

`$3`

For Effect-based pipelines:

`typescript import { chunkStreamEffect } from 'code-chunk' import { Effect, Stream } from 'effect'

const program = Stream.runForEach( chunkStreamEffect('src/utils.ts', code), (chunk) => Effect.log(chunk.text) )

await Effect.runPromise(program)`

`API Reference`

`$3`

Chunk source code into semantic pieces with context.

Parameters: -filepath: File path (used for language detection) -code: Source code string -options: Optional configuration

Returns: Promise

Throws: ChunkingError, UnsupportedLanguageError

---

`$3`

Stream chunks as they're generated. Useful for large files.

Returns: AsyncGenerator

Note: chunk.totalChunks is -1 in streaming mode (unknown upfront).

---

`$3`

Effect-native streaming API for composable pipelines.

Returns: Stream.Stream

---

`$3`

Create a reusable chunker instance with default options.

Returns: Chunker with chunk(), stream(), chunkBatch(), and chunkBatchStream() methods

---

`$3`

Process multiple files concurrently with per-file error handling.

Parameters: -files: Array of { filepath, code, options? }-options: Batch options (extends ChunkOptions with concurrency and onProgress)

Returns: Promise where each result has { filepath, chunks, error }

---

`$3`

Stream batch results as files complete processing.

Returns: AsyncGenerator

---

`$3`

Effect-native batch processing.

Returns: Effect.Effect

---

`$3`

Effect-native streaming batch processing.

Returns: Stream.Stream

---

`$3`

Format chunk text with semantic context prepended. Useful for custom embedding pipelines.

Returns: string

---

`$3`

Detect programming language from file extension.

Returns: Language | null

---

`$3`

| Option | Type | Default | Description | |--------|------|---------|-------------| |maxChunkSize | number | 1500| Maximum chunk size in bytes | |contextMode | 'none' \| 'minimal' \| 'full' | 'full'| How much context to include | |siblingDetail | 'none' \| 'names' \| 'signatures' | 'signatures'| Level of sibling detail | |filterImports | boolean | false| Filter out import statements | |language | Language| auto | Override language detection | |overlapLines | number | 10 | Lines from previous chunk to include in contextualizedText |

`$3`

Extends ChunkOptions with:

| Option | Type | Default | Description | |--------|------|---------|-------------| |concurrency | number | 10| Maximum files to process concurrently | |onProgress | function | - | Callback (completed, total, filepath, success) => void |

---

`$3`

| Language | Extensions | |----------|------------| | TypeScript |.ts, .tsx, .mts, .cts| | JavaScript |.js, .jsx, .mjs, .cjs| | Python |.py, .pyi| | Rust |.rs| | Go |.go| | Java |.java |

---

`$3`

ChunkingError: Thrown when chunking fails (parsing error, extraction error, etc.)

UnsupportedLanguageError: Thrown when the file extension is not supported

Both errors have a _tag` property for Effect-style error handling.

License

MIT

code-chunk

AST-aware code chunking for semantic search and RAG pipelines.

- Features
- How It Works
- Installation
- Quickstart
- API Reference
- License

Features

How It Works

Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. code-chunk takes a different approach:

$3

Source code is parsed into an Abstract Syntax Tree (AST) using tree-sitter. This gives us a structured representation of the code that understands language grammar.

$3

This context is formatted into contextualizedText, optimized for embedding models to understand semantic relationships.

Installation

``bash bun add code-chunk

`or`


npm install code-chunk


Quickstart
$3

`typescript import { chunk } from 'code-chunk'

const chunks = await chunk('src/user.ts', sourceCode)

`$3`

Use contextualizedText for better embedding quality in RAG systems:

The contextualizedText prepends semantic context to the raw code:

`src/services/user.ts`


Scope: UserService

Defines: async getUser(id: string): Promise

Uses: Database

After: constructor

async getUser(id: string): Promise { return this.db.query('SELECT * FROM users WHERE id = ?', [id]) }`

`$3`

Process chunks incrementally without loading everything into memory:

`typescript import { chunkStream } from 'code-chunk'

for await (const c of chunkStream('src/large.ts', code)) { await process(c) }`

`$3`

Create a chunker instance when processing multiple files with the same config:

`typescript import { createChunker } from 'code-chunk'

const chunker = createChunker({ maxChunkSize: 2048, contextMode: 'full', siblingDetail: 'signatures', })

for (const file of files) { const chunks = await chunker.chunk(file.path, file.content) }`

`$3`

Process multiple files concurrently with error handling per file:

`typescript import { chunkBatch } from 'code-chunk'

const files = [ { filepath: 'src/user.ts', code: userCode }, { filepath: 'src/auth.ts', code: authCode }, { filepath: 'lib/utils.py', code: utilsCode }, ]

for (const result of results) { if (result.error) { console.error(Failed: ${result.filepath}, result.error) } else { await indexChunks(result.filepath, result.chunks) } }`

Stream results as they complete:

`typescript import { chunkBatchStream } from 'code-chunk'

for await (const result of chunkBatchStream(files, { concurrency: 5 })) { if (result.chunks) { await indexChunks(result.filepath, result.chunks) } }`

`$3`

For Effect-based pipelines:

`typescript import { chunkStreamEffect } from 'code-chunk' import { Effect, Stream } from 'effect'

const program = Stream.runForEach( chunkStreamEffect('src/utils.ts', code), (chunk) => Effect.log(chunk.text) )

await Effect.runPromise(program)`

`API Reference`

`$3`

Chunk source code into semantic pieces with context.

Parameters: -filepath: File path (used for language detection) -code: Source code string -options: Optional configuration

Returns: Promise

Throws: ChunkingError, UnsupportedLanguageError

---

`$3`

Stream chunks as they're generated. Useful for large files.

Returns: AsyncGenerator

Note: chunk.totalChunks is -1 in streaming mode (unknown upfront).

---

`$3`

Effect-native streaming API for composable pipelines.

Returns: Stream.Stream

---

`$3`

Create a reusable chunker instance with default options.

Returns: Chunker with chunk(), stream(), chunkBatch(), and chunkBatchStream() methods

---

`$3`

Process multiple files concurrently with per-file error handling.

Parameters: -files: Array of { filepath, code, options? }-options: Batch options (extends ChunkOptions with concurrency and onProgress)

Returns: Promise where each result has { filepath, chunks, error }

---

`$3`

Stream batch results as files complete processing.

Returns: AsyncGenerator

---

`$3`

Effect-native batch processing.

Returns: Effect.Effect

---

`$3`

Effect-native streaming batch processing.

Returns: Stream.Stream

---

`$3`

Format chunk text with semantic context prepended. Useful for custom embedding pipelines.

Returns: string

---

`$3`

Detect programming language from file extension.

Returns: Language | null

---

`$3`

Extends ChunkOptions with:

---

`$3`

---

`$3`

ChunkingError: Thrown when chunking fails (parsing error, extraction error, etc.)

UnsupportedLanguageError: Thrown when the file extension is not supported

Both errors have a _tag` property for Effect-style error handling.

License

MIT

code-chunk

code-chunk

Table of Contents

Features

How It Works

$3

$3

$3

$3

$3

Installation

or

Quickstart

$3

$3

src/services/user.ts

Scope: UserService

Defines: async getUser(id: string): Promise

Uses: Database

After: constructor

$3

$3

$3

$3

API Reference

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

License

code-chunk

code-chunk

Table of Contents

Features

How It Works

$3

$3

$3

$3

$3

Installation

or

Quickstart

$3

$3

src/services/user.ts

Scope: UserService

Defines: async getUser(id: string): Promise

Uses: Database

After: constructor

$3

$3

$3

$3

API Reference

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

License

`or`

`$3`

`src/services/user.ts`

`$3`

`$3`

`$3`

`$3`

`API Reference`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`or`

`$3`

`src/services/user.ts`

`$3`

`$3`

`$3`

`$3`

`API Reference`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`