A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.
npm install @ioris/tokenizer-kuromojiA specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.


@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for:
- Karaoke Applications - Generate natural phrase breaks for synchronized lyrics display
- Music Apps - Improve lyrics readability through intelligent segmentation
- Lyrics Analysis - Analyze song structure and linguistic patterns
- Subtitle Generation - Create formatted subtitles for music videos
- Language Learning - Study Japanese lyrics with proper phrase boundaries
- Advanced rule-based system for natural phrase breaks
- Part-of-speech analysis for accurate break placement
- Configurable boundary rules with score-based strength evaluation
- Seamless processing of Japanese and English text
- Script type detection (Japanese/Latin/Number)
- Script change boundary detection
- Specialized handling of parentheses, quotation marks, and repetition patterns
- Timeline preservation (maintains temporal relationships while adding logical segmentation)
- Whitespace break detection
- Customizable boundary rules
- Token-based and position-based rule conditions
- Multiple break strength levels (Strong/Medium/Weak/None)
``bash`
npm install @ioris/tokenizer-kuromoji
`typescript
import { LineArgsTokenizer } from '@ioris/tokenizer-kuromoji';
const text = 'æĄãŪčąãåēããĶãã Beautiful spring day';
const result = await LineArgsTokenizer({ text });
console.log(result.phrases);
// Output: Array of Phrase objects with intelligent segmentation
`
`typescript
import { LineArgsTokenizer, parseWithKuromoji, generateBreaksOnTokens, segmentByBreakAfter } from '@ioris/tokenizer-kuromoji';
// Parse text with Kuromoji
const tokens = await parseWithKuromoji(text);
// Generate breaks based on rules
const tokensWithBreaks = generateBreaksOnTokens(tokens);
// Segment into phrases
const phrases = segmentByBreakAfter(tokensWithBreaks, text);
`
The tokenization process follows this flow:
`mermaid
flowchart TD
Start([Input Text]) --> Parse[Parse with Kuromoji]
Parse --> Tokens[Morphological Tokens]
Tokens --> Script[Detect Script Types]
Script --> Rules[Apply Boundary Rules]
Rules --> CheckRules{Evaluate Rules}
CheckRules -->|Token-based| TokenRule[Check POS, surface, etc.]
CheckRules -->|Position-based| PosRule[Check text position]
TokenRule --> Score[Calculate Break Score]
PosRule --> Score
Score --> Strength[Map to Break Strength]
Strength -->|Strong| StrongBreak[Strong Break]
Strength -->|Medium| MediumBreak[Medium Break]
Strength -->|Weak| WeakBreak[Weak Break]
Strength -->|None| NoBreak[No Break]
StrongBreak --> Segment
MediumBreak --> Segment
WeakBreak --> Segment
NoBreak --> Segment
Segment[Segment by Breaks] --> BuildPhrase[Build Phrases]
BuildPhrase --> Timeline[Apply Timeline]
Timeline --> Result([Output Phrases])
style Start fill:#e1f5ff
style Result fill:#e1f5ff
style Parse fill:#fff4e1
style Rules fill:#fff4e1
style Segment fill:#fff4e1
style BuildPhrase fill:#fff4e1
`
1. Morphological Analysis: Parse input text using Kuromoji to get tokens with part-of-speech information
2. Script Detection: Identify script types (Japanese/Latin/Number) for each token
3. Rule Application: Evaluate boundary rules based on:
- Token properties (POS, surface form, reading)
- Position in text (brackets, quotes, whitespace)
- Script changes between tokens
4. Break Scoring: Calculate break strength score from matched rules
5. Strength Mapping: Convert scores to break strength levels (Strong/Medium/Weak/None)
6. Segmentation: Split tokens into phrases based on break points
7. Phrase Building: Construct phrase objects with proper text and metadata
8. Timeline Application: Apply temporal information (startTime/endTime) to phrases
#### LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise
Main tokenizer function that processes text and returns segmented phrases.
Parameters:
- args.text (string) - The text to tokenizeargs.startTime
- (number, optional) - Start timestampargs.endTime
- (number, optional) - End timestamp
Returns: Promise resolving to TokenizeResult containing phrases array
#### parseWithKuromoji(text: string): Promise
Parse text using Kuromoji morphological analyzer.
#### generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]
Apply boundary rules to tokens and generate break information.
#### segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]
Segment tokens into phrases based on break information.
`bashInstall dependencies
npm install
$3
`plaintext
.
âââ src/
â âââ Tokenizer.Kuromoji.ts # Main tokenizer implementation
â âââ types.ts # Type definitions
â âââ rules.ts # Boundary rule definitions
â âââ constants.ts # Constants
â âââ index.ts # Entry point
â âââ *.test.ts # Integration tests
â âââ *.unit.test.ts # Unit tests
âââ dist/ # Build output
âââ coverage/ # Test coverage reports
`$3
The project uses Vitest for testing with comprehensive test coverage:
- Unit tests for individual functions
- Integration tests for complete tokenization flows
- Coverage reporting with
@vitest/coverage-v8Run
npm run test:coverage to generate coverage reports.Technical Details
$3
- Node.js >= 16.0
- TypeScript >= 5.0
$3
-
@ioris/core - Ioris framework core
- kuromoji` - Japanese morphological analyzer- Build: esbuild + TypeScript compiler
- Testing: Vitest
- Linting/Formatting: Biome
- Task Runner: npm-run-all
MIT
Contributions are welcome! Please feel free to submit a Pull Request.
https://github.com/8beeeaaat/ioris_tokenizer_kuromoji
https://github.com/8beeeaaat/ioris_tokenizer_kuromoji/issues