A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions.
npm install paragrafs
!Bun

!GitHub License
!GitHub Release


!typescript
!npm
!GitHub issues
!GitHub stars
!CodeRabbit Pull Request Reviews
A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.
- Segment reconstruction – marks filler words, hints, and time gaps to create natural paragraph boundaries and merges overly short segments back into their predecessors.【F:src/transcript.ts†L40-L204】【F:src/transcript.ts†L236-L300】
- Timestamped formatting – produces human-friendly transcripts with optional custom formatting callbacks and automatic timestamp rendering.【F:src/transcript.ts†L212-L300】
- Ground-truth alignment – synchronises AI generated tokens with human edited text, interpolating timings for missing words and removing unknown tokens when applying the ground truth.【F:src/utils/transcriptUtils.ts†L1-L226】【F:src/transcript.ts†L328-L395】
- Selection helpers – exposes utilities to find tokens for string queries or cursor selections, enabling rich text editors to jump to precise timestamps.【F:src/transcript.ts†L424-L493】
- Hint system (Arabic-first) – robust multi-word hint matching using normalization (diacritics/punctuation tolerant), plus hard boundary insertion via ALWAYS_BREAK.【F:src/utils/textUtils.ts†L59-L156】【F:src/transcript.ts†L40-L121】
- Auto-hint generation – mines frequent repeated phrases from Token[] or Segment[] and returns sorted hint candidates for Arabic-heavy transcripts.【F:src/utils/hints.ts†L303-L379】
- Utility toolkit – includes helpers for timestamp formatting, punctuation detection (including Arabic punctuation), ground-truth tokenization, and normalization utilities.【F:src/utils/textUtils.ts†L4-L185】
- Bun-native toolchain – powered by the upstream tsdown CLI for bundling and Biome for linting, so the same commands run locally and in CI without any custom wrappers.【F:package.json†L7-L41】【F:tsdown.config.ts†L1-L9】【F:biome.json†L1-L16】
- Hints are normalized by default: createHints(...) now uses Arabic-first normalization for matching and mining. If you relied on exact string matching, update your expectations and/or pass explicit normalization options.【F:src/utils/textUtils.ts†L121-L156】
- ALWAYS_BREAK is a true hard boundary: segments/lines after an ALWAYS_BREAK must not be merged into previous segments.【F:src/transcript.ts†L95-L167】【F:src/transcript.ts†L173-L211】
``bash`
npm install paragrafs
or
`bash`
pnpm install paragrafs
or
`bash`
yarn add paragrafs
or
`bash`
bun add paragrafs
`typescript
import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';
// Example token from transcription
const token = {
start: 0,
end: 5,
text: 'This is a sample text. It should be properly segmented.',
};
// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);
// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);
console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."
`
`typescript
import {
markAndCombineSegments,
mapSegmentsIntoFormattedSegments,
formatSegmentsToTimestampedTranscript,
} from 'paragrafs';
// Example transcription segments
const segments = [
{
start: 0,
end: 6.5,
text: 'The quick brown fox!',
tokens: [
{ start: 0, end: 1, text: 'The' },
{ start: 1, end: 2, text: 'quick' },
{ start: 2, end: 3, text: 'brown' },
{ start: 3, end: 6.5, text: 'fox!' },
],
},
{
start: 8,
end: 13,
text: 'Jumps right over the',
tokens: [
{ start: 8, end: 9, text: 'Jumps' },
{ start: 9, end: 10, text: 'right' },
{ start: 10, end: 11, text: 'over' },
{ start: 12, end: 13, text: 'the' },
],
},
];
// Options for segment formatting
const options = {
fillers: ['uh', 'umm', 'hmmm'],
gapThreshold: 3,
maxSecondsPerSegment: 12,
minWordsPerSegment: 3,
};
// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);
// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);
console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over the
`
`typescript
import { updateSegmentWithGroundTruth } from 'paragrafs';
const rawSegment = {
start: 0,
end: 10,
text: 'The Buick crown flock jumps right over the crazy dog.',
tokens: [
/ AI-generated word timestamps /
],
};
const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.');
console.log(aligned.tokens);
// Each token now matches the ground-truth words exactly,
// with missing words interpolated where needed.
`
Use this when you have a corpus of tokens/segments and want to discover repeated phrases like "احسن الله اليكم".
`typescript
import { createHints, generateHintsFromTokens, markTokensWithDividers } from 'paragrafs';
const tokens = [
{ start: 0, end: 1, text: 'أَحْسَنَ' },
{ start: 1, end: 2, text: 'الله' },
{ start: 2, end: 3, text: 'إليكم،' },
// ... repeated in the stream ...
];
const mined = generateHintsFromTokens(tokens, {
minN: 2,
maxN: 4,
minCount: 2,
dedupe: 'closed',
normalization: { normalizeAlef: true },
});
// Turn mined phrases into matching hints
const hints = createHints({ normalizeAlef: true }, ...mined.slice(0, 25).map((h) => h.phrase));
const marked = markTokensWithDividers(tokens, { fillers: [], gapThreshold: 999, hints });
`
- bun run build – compiles the library with the official tsdown pipeline configured in tsdown.config.ts.【F:package.json†L33-L41】【F:tsdown.config.ts†L1-L9】bun run lint
- – runs Biome’s formatter and linter against the repository root.【F:package.json†L33-L41】【F:biome.json†L1-L16】bun test
- – executes the Bun test suite.bun test --coverage
- – runs tests with coverage reporting (useful for refactors of segmentation/matching logic).
This repo includes a minimal static demo app in demo/ that exercises the major exported functions with configurable JSON/text inputs. It’s intended to be deployed to paragrafs.surge.sh.
- Live demo: paragrafs.surge.sh
- Install: bun run demo:installbun run demo:dev
- Dev: bun run demo:build
- Build: bun run demo:deploy
- Deploy to Surge:
Notes:
- The demo depends on the local package via file:.., so demo:build runs bun run build first to ensure dist/ exists.demo/dist
- Deploy target folder is .
- estimateSegmentFromToken(token: Token): Segment – splits multi-word tokens into per-word timings so they can participate in downstream processing.【F:src/transcript.ts†L15-L39】markTokensWithDividers(tokens: Token[], options: MarkTokensWithDividersOptions): MarkedToken[]
- – inserts divider markers based on fillers, hints, punctuation, and timing gaps.【F:src/transcript.ts†L44-L121】groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]
- – chunks marked tokens into bounded-length segments.【F:src/transcript.ts†L123-L171】mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]
- – merges segments that contain fewer than the required word count into their predecessors.【F:src/transcript.ts†L173-L211】cleanupIsolatedTokens(markedTokens: MarkedToken[]): MarkedToken[]
- – removes redundant divider markers that would isolate a single token on a line.【F:src/transcript.ts†L314-L326】markAndCombineSegments(segments: Segment[], options): MarkedSegment[]
- – convenience pipeline that flattens tokens, marks dividers, groups, and merges short runs in one call.【F:src/transcript.ts†L302-L326】mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]
- – flattens marked segments into readable text while respecting optional line duration caps.【F:src/transcript.ts†L236-L300】formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number, formatTokens?: (buffer: Token) => string): string
- – emits newline separated transcript lines with timestamps or a custom formatter.【F:src/transcript.ts†L204-L234】
- updateSegmentWithGroundTruth(segment: Segment, groundTruth: string): GroundedSegment – applies LCS-based alignment to replace tokens with the ground-truth words while flagging unmatched entries.【F:src/transcript.ts†L328-L359】applyGroundTruthToSegment(segment: Segment, groundTruth: string): Segment
- – wraps updateSegmentWithGroundTruth and filters unknown tokens for production-ready output.【F:src/transcript.ts†L361-L395】mergeSegments(segments: Segment[], delimiter?: string): Segment
- – concatenates sequential segments into one continuous block, preserving timing.【F:src/transcript.ts†L397-L411】splitSegment(segment: Segment, splitTime: number): Segment[]
- – divides a segment into two at a specific timestamp.【F:src/transcript.ts†L413-L448】
- getFirstMatchingToken(tokens: Token[], query: string): Token | null – scans for the first occurrence of a hint sequence produced by createHints.【F:src/transcript.ts†L450-L493】getFirstTokenForSelection(segment: Segment, selectionStart: number, selectionEnd: number): Token | null
- – maps character selections within segment.text back to the corresponding timed token.【F:src/transcript.ts†L495-L546】
- createHints(first: ArabicNormalizationOptions | string, ...rest: string[]): Hints – creates normalized hints for robust Arabic matching (diacritics/punctuation tolerant).【F:src/utils/textUtils.ts†L121-L156】formatSecondsToTimestamp(seconds: number): string
- – renders numeric durations into m:ss or h:mm:ss strings.【F:src/utils/textUtils.ts†L14-L33】isEndingWithPunctuation(text: string): boolean
- – checks for trailing punctuation, including Arabic variants.【F:src/utils/textUtils.ts†L4-L12】tokenizeGroundTruth(groundTruth: string): string[]
- – tokenises human transcripts while attaching punctuation to the preceding word.【F:src/utils/textUtils.ts†L75-L112】normalizeTokenText(text: string, options?: ArabicNormalizationOptions): string
- – Arabic-first normalization used by hint matching and hint mining.【F:src/utils/textUtils.ts†L59-L103】
- generateHintsFromTokens(tokens: Token[], options?: GenerateHintsOptions): GeneratedHint[] – mines frequent n-grams from a token stream and returns candidates sorted by count/length.【F:src/utils/hints.ts†L303-L331】generateHintsFromSegments(segments: Segment[], options?: GenerateHintsOptions): GeneratedHint[]
- – mines frequent n-grams from segments; by default phrases do not cross segment boundaries.【F:src/utils/hints.ts†L333-L379】
`typescript
type Token = {
start: number;
end: number;
text: string;
};
type Segment = Token & {
tokens: Token[];
};
type MarkedToken = Token | typeof SEGMENT_BREAK | typeof ALWAYS_BREAK;
type MarkedSegment = {
start: number;
end: number;
tokens: MarkedToken[];
};
type GroundedToken = Token & { isUnknown?: boolean };
type GroundedSegment = Omit
`
- Transcript Formatting: Convert raw transcriptions into readable text
- Subtitle Generation: Create properly formatted subtitles from audio transcriptions
- Document Reconstruction: Rebuild properly formatted documents from extracted text
Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.
To get started:
1. Fork the repository
2. Install dependencies: bun install (requires Bun)bun run lint
3. Make your changes
4. Run linting: bun run build
5. Build the package: bun test --coverage
6. Run tests:
7. Submit a pull request
paragrafs` is released under the MIT License. See the LICENSE.MD file for more details.
Ragaeeb Haq
---
Built with TypeScript and Bun. Uses ESM module format.