@ioris/tokenizer-kuromoji

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

![npm version](https://www.npmjs.com/package/@ioris/tokenizer-kuromoji)
![License: MIT](https://opensource.org/licenses/MIT)

Overview

@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for:

- Karaoke Applications - Generate natural phrase breaks for synchronized lyrics display
- Music Apps - Improve lyrics readability through intelligent segmentation
- Lyrics Analysis - Analyze song structure and linguistic patterns
- Subtitle Generation - Create formatted subtitles for music videos
- Language Learning - Study Japanese lyrics with proper phrase boundaries

Features

$3

- Advanced rule-based system for natural phrase breaks
- Part-of-speech analysis for accurate break placement
- Configurable boundary rules with score-based strength evaluation

$3

- Seamless processing of Japanese and English text
- Script type detection (Japanese/Latin/Number)
- Script change boundary detection

$3

- Specialized handling of parentheses, quotation marks, and repetition patterns
- Timeline preservation (maintains temporal relationships while adding logical segmentation)
- Whitespace break detection

$3

- Customizable boundary rules
- Token-based and position-based rule conditions
- Multiple break strength levels (Strong/Medium/Weak/None)

Installation

``bash npm install @ioris/tokenizer-kuromoji`

`Usage`

`$3`

`typescript import { LineArgsTokenizer } from '@ioris/tokenizer-kuromoji';

const text = '桜の花が咲いている Beautiful spring day'; const result = await LineArgsTokenizer({ text });

console.log(result.phrases); // Output: Array of Phrase objects with intelligent segmentation`

`$3`

`typescript import { LineArgsTokenizer, parseWithKuromoji, generateBreaksOnTokens, segmentByBreakAfter } from '@ioris/tokenizer-kuromoji';

// Parse text with Kuromoji const tokens = await parseWithKuromoji(text);

// Generate breaks based on rules const tokensWithBreaks = generateBreaksOnTokens(tokens);

// Segment into phrases const phrases = segmentByBreakAfter(tokensWithBreaks, text);`

`Processing Flow`

The tokenization process follows this flow:

`mermaid flowchart TD Start([Input Text]) --> Parse[Parse with Kuromoji] Parse --> Tokens[Morphological Tokens] Tokens --> Script[Detect Script Types] Script --> Rules[Apply Boundary Rules]

Rules --> CheckRules{Evaluate Rules} CheckRules -->|Token-based| TokenRule[Check POS, surface, etc.] CheckRules -->|Position-based| PosRule[Check text position]

TokenRule --> Score[Calculate Break Score] PosRule --> Score

StrongBreak --> Segment MediumBreak --> Segment WeakBreak --> Segment NoBreak --> Segment

Segment[Segment by Breaks] --> BuildPhrase[Build Phrases] BuildPhrase --> Timeline[Apply Timeline] Timeline --> Result([Output Phrases])

style Start fill:#e1f5ff style Result fill:#e1f5ff style Parse fill:#fff4e1 style Rules fill:#fff4e1 style Segment fill:#fff4e1 style BuildPhrase fill:#fff4e1`

`$3`

1. Morphological Analysis: Parse input text using Kuromoji to get tokens with part-of-speech information 2. Script Detection: Identify script types (Japanese/Latin/Number) for each token 3. Rule Application: Evaluate boundary rules based on: - Token properties (POS, surface form, reading) - Position in text (brackets, quotes, whitespace) - Script changes between tokens 4. Break Scoring: Calculate break strength score from matched rules 5. Strength Mapping: Convert scores to break strength levels (Strong/Medium/Weak/None) 6. Segmentation: Split tokens into phrases based on break points 7. Phrase Building: Construct phrase objects with proper text and metadata 8. Timeline Application: Apply temporal information (startTime/endTime) to phrases

`API Reference`

`$3`

#### LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise

Main tokenizer function that processes text and returns segmented phrases.

Parameters:

- args.text(string) - The text to tokenize -args.startTime(number, optional) - Start timestamp -args.endTime (number, optional) - End timestamp

Returns: Promise resolving to TokenizeResult containing phrases array

#### parseWithKuromoji(text: string): Promise

Parse text using Kuromoji morphological analyzer.

#### generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]

Apply boundary rules to tokens and generate break information.

#### segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]

Segment tokens into phrases based on break information.

`Development`

`$3`

`bash

`Install dependencies`


npm install
Run tests

npm test
Run tests with coverage

npm run test:coverage
Run tests in watch mode

npm run test:watch
Build the library

npm run build
Run linter

npm run lint
Format code

npm run format

$3

`plaintext . ├── src/ │ ├── Tokenizer.Kuromoji.ts # Main tokenizer implementation │ ├── types.ts # Type definitions │ ├── rules.ts # Boundary rule definitions │ ├── constants.ts # Constants │ ├── index.ts # Entry point │ ├── *.test.ts # Integration tests │ └── *.unit.test.ts # Unit tests ├── dist/ # Build output └── coverage/ # Test coverage reports`

`$3`

The project uses Vitest for testing with comprehensive test coverage:

- Unit tests for individual functions - Integration tests for complete tokenization flows - Coverage reporting with@vitest/coverage-v8

Run npm run test:coverage to generate coverage reports.

`Technical Details`

`$3`

- Node.js >= 16.0 - TypeScript >= 5.0

`$3`

- @ioris/core- Ioris framework core -kuromoji` - Japanese morphological analyzer

$3

- Build: esbuild + TypeScript compiler
- Testing: Vitest
- Linting/Formatting: Biome
- Task Runner: npm-run-all

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Repository

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji

Issues

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji/issues

@ioris/tokenizer-kuromoji

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

![npm version](https://www.npmjs.com/package/@ioris/tokenizer-kuromoji)
![License: MIT](https://opensource.org/licenses/MIT)

Overview

Features

$3

- Advanced rule-based system for natural phrase breaks
- Part-of-speech analysis for accurate break placement
- Configurable boundary rules with score-based strength evaluation

$3

- Seamless processing of Japanese and English text
- Script type detection (Japanese/Latin/Number)
- Script change boundary detection

$3

- Customizable boundary rules
- Token-based and position-based rule conditions
- Multiple break strength levels (Strong/Medium/Weak/None)

Installation

``bash npm install @ioris/tokenizer-kuromoji`

`Usage`

`$3`

`typescript import { LineArgsTokenizer } from '@ioris/tokenizer-kuromoji';

const text = '桜の花が咲いている Beautiful spring day'; const result = await LineArgsTokenizer({ text });

console.log(result.phrases); // Output: Array of Phrase objects with intelligent segmentation`

`$3`

`typescript import { LineArgsTokenizer, parseWithKuromoji, generateBreaksOnTokens, segmentByBreakAfter } from '@ioris/tokenizer-kuromoji';

// Parse text with Kuromoji const tokens = await parseWithKuromoji(text);

// Generate breaks based on rules const tokensWithBreaks = generateBreaksOnTokens(tokens);

// Segment into phrases const phrases = segmentByBreakAfter(tokensWithBreaks, text);`

`Processing Flow`

The tokenization process follows this flow:

`mermaid flowchart TD Start([Input Text]) --> Parse[Parse with Kuromoji] Parse --> Tokens[Morphological Tokens] Tokens --> Script[Detect Script Types] Script --> Rules[Apply Boundary Rules]

Rules --> CheckRules{Evaluate Rules} CheckRules -->|Token-based| TokenRule[Check POS, surface, etc.] CheckRules -->|Position-based| PosRule[Check text position]

TokenRule --> Score[Calculate Break Score] PosRule --> Score

StrongBreak --> Segment MediumBreak --> Segment WeakBreak --> Segment NoBreak --> Segment

Segment[Segment by Breaks] --> BuildPhrase[Build Phrases] BuildPhrase --> Timeline[Apply Timeline] Timeline --> Result([Output Phrases])

style Start fill:#e1f5ff style Result fill:#e1f5ff style Parse fill:#fff4e1 style Rules fill:#fff4e1 style Segment fill:#fff4e1 style BuildPhrase fill:#fff4e1`

`$3`

`API Reference`

`$3`

#### LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise

Main tokenizer function that processes text and returns segmented phrases.

Parameters:

- args.text(string) - The text to tokenize -args.startTime(number, optional) - Start timestamp -args.endTime (number, optional) - End timestamp

Returns: Promise resolving to TokenizeResult containing phrases array

#### parseWithKuromoji(text: string): Promise

Parse text using Kuromoji morphological analyzer.

#### generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]

Apply boundary rules to tokens and generate break information.

#### segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]

Segment tokens into phrases based on break information.

`Development`

`$3`

`bash

`Install dependencies`


npm install
Run tests

npm test
Run tests with coverage

npm run test:coverage
Run tests in watch mode

npm run test:watch
Build the library

npm run build
Run linter

npm run lint
Format code

npm run format

$3

`$3`

The project uses Vitest for testing with comprehensive test coverage:

- Unit tests for individual functions - Integration tests for complete tokenization flows - Coverage reporting with@vitest/coverage-v8

Run npm run test:coverage to generate coverage reports.

`Technical Details`

`$3`

- Node.js >= 16.0 - TypeScript >= 5.0

`$3`

- @ioris/core- Ioris framework core -kuromoji` - Japanese morphological analyzer

$3

- Build: esbuild + TypeScript compiler
- Testing: Vitest
- Linting/Formatting: Biome
- Task Runner: npm-run-all

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Repository

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji

Issues

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji/issues