@visionengine/subtitle-generate

![English](README.md)
![中文文档](README-zh.md)

VisionEngine Subtitle Generation MCP Server - Generate subtitles from audio/video files with automatic timing alignment using ByteDance's speech recognition API.

Features

- Subtitle Generation - Generate subtitles from audio/video files with speech recognition
- Subtitle Alignment - Align existing subtitle text with audio for precise timing
- Multiple Languages - Support for Chinese, English, Japanese, Korean, and more
- SRT Output - Automatically save subtitles in SRT format
- Word-level Timing - Get precise timing for each word/character
- Speaker Detection - Optional speaker identification

Installation

$3

Add to your MCP client configuration:

``

json

{

  "mcpServers": {

    "ve-subtitle-generate": {

      "type": "local",

      "command": "npx",

      "args": ["-y", "@visionengine/subtitle-generate@latest"],

      "transport": "stdio",

      "env": {

        "APP_ID": "your_app_id",

        "ACCESS_TOKEN": "your_access_token",

        "WORKDIR": "./media"

      }

    }

  }

}

$3

bash

npm install -g @visionengine/subtitle-generate





Configuration



Environment variables:



-

API_BASE_URL

 - API endpoint (default: https://openspeech.bytedance.com)

-

APP_ID

 - Your application ID (required)

-

ACCESS_TOKEN

 - Your Bearer token for authentication (required)

-

WORKDIR

 - Base directory for relative file paths (default: ./)



Tools



$3



Generate subtitles from audio/video files using speech recognition.



Parameters:

-

audioPath

 (string, required) - Audio/video file path (relative to WORKDIR or absolute)

-

language

 (string, optional) - Language code: zh-CN, en-US, ja-JP, ko-KR, etc.

-

wordsPerLine

 (number, optional) - Maximum words per line (default: 46)

-

maxLines

 (number, optional) - Maximum lines per screen (default: 1)

-

useItn

 (boolean, optional) - Convert Chinese numbers to Arabic numerals

-

captionType

 (enum, optional) - 'auto', 'speech', or 'singing'

-

usePunc

 (boolean, optional) - Add punctuation marks

-

useDdc

 (boolean, optional) - Add silence annotations

-

withSpeakerInfo

 (boolean, optional) - Return speaker information



Supported Languages:



| Language | Code | Recommended words_per_line |

|----------|------|---------------------------|

| Chinese (Simplified) | zh-CN | 15 |

| Cantonese | yue | 15 |

| English (US) | en-US | 55 |

| Japanese | ja-JP | 32 |

| Korean | ko-KR | 32 |

| Spanish | es-MX | 55 |

| Russian | ru-RU | 55 |

| French | fr-FR | 55 |



Example:

typescript

// Basic usage

await subtitle_generate({

  audioPath: "./video.mp4"

});



// With options

await subtitle_generate({

  audioPath: "./video.mp4",

  language: "zh-CN",

  wordsPerLine: 15,

  maxLines: 2,

  captionType: "speech",

  usePunc: true

});





Output:

- SRT file saved in the same directory as the input file

- JSON response with utterances and timing information



$3



Align existing subtitle text with audio for precise timing.



Parameters:

-

audioPath

 (string, required) - Audio/video file path (relative to WORKDIR or absolute)

-

subtitleText

 (string, required) - The subtitle text to align with the audio

-

captionType

 (enum, required) - 'speech' or 'singing'

-

staPuncMode

 (enum, optional) - Punctuation mode: '1', '2', or '3'



Punctuation Modes:

-

 (default) - Omit trailing punctuation from alignment results

-

 - Replace punctuation with spaces

-

 - Keep original punctuation



Example:

typescript

// Align speech subtitle

await subtitle_align({

  audioPath: "./speech.wav",

  subtitleText: "Hello, welcome to our presentation today.",

  captionType: "speech"

});



// Align song lyrics

await subtitle_align({

  audioPath: "./song.mp3",

  subtitleText: "这是一首美丽的歌曲",

  captionType: "singing",

  staPuncMode: "3"

});





Output:

- SRT file saved with

_aligned

 suffix

- JSON response with word-level timing information



Usage Examples



$3



Once configured as an MCP server, the tools are available through your MCP client:



> Generate subtitles for video.mp4

> Align these lyrics with song.mp3: "今天天气真好..."

$3

bash

Install globally

npm install -g @visionengine/subtitle-generate



Set environment variables

export APP_ID="your_app_id"

export ACCESS_TOKEN="your_access_token"

export WORKDIR="./media"



Run the server

ve-subtitle-generate





Output Format



The tools save subtitles in SRT format:



1

00:00:00,000 --> 00:00:03,197

如果您没有其他需要举报的话这边就先挂断了



2

00:00:03,442 --> 00:00:04,877

祝您生活愉快再见





Error Codes



| Code | Meaning | Description |

|------|---------|-------------|

| 0 | Success | - |

| 2000 | Processing | Task is being processed |

| 1001 | Invalid parameters | Missing/invalid request parameters |

| 1002 | No permission | Token invalid/expired |

| 1003 | Rate limited | QPS exceeded |

| 1010 | Audio too long | Duration exceeded threshold |

| 1012 | Invalid audio format | Audio decode failure |

| 1013 | Silent audio | No speech detected |



Development



$3

bash

pnpm build

$3

bash

pnpm test

$3

bash

Build first

pnpm build



Run locally

node dist/index.js

Support

For issues and questions:

- Email: team@visionengine-tech.com
- Website: https://visionengine-tech.com

@visionengine/subtitle-generate

Features

Installation

$3

Add to your MCP client configuration:

``

json

{

  "mcpServers": {

    "ve-subtitle-generate": {

      "type": "local",

      "command": "npx",

      "args": ["-y", "@visionengine/subtitle-generate@latest"],

      "transport": "stdio",

      "env": {

        "APP_ID": "your_app_id",

        "ACCESS_TOKEN": "your_access_token",

        "WORKDIR": "./media"

      }

    }

  }

}

$3

bash

npm install -g @visionengine/subtitle-generate





Configuration



Environment variables:



-

API_BASE_URL

 - API endpoint (default: https://openspeech.bytedance.com)

-

APP_ID

 - Your application ID (required)

-

ACCESS_TOKEN

 - Your Bearer token for authentication (required)

-

WORKDIR

 - Base directory for relative file paths (default: ./)



Tools



$3



Generate subtitles from audio/video files using speech recognition.



Parameters:

-

audioPath

 (string, required) - Audio/video file path (relative to WORKDIR or absolute)

-

language

 (string, optional) - Language code: zh-CN, en-US, ja-JP, ko-KR, etc.

-

wordsPerLine

 (number, optional) - Maximum words per line (default: 46)

-

maxLines

 (number, optional) - Maximum lines per screen (default: 1)

-

useItn

 (boolean, optional) - Convert Chinese numbers to Arabic numerals

-

captionType

 (enum, optional) - 'auto', 'speech', or 'singing'

-

usePunc

 (boolean, optional) - Add punctuation marks

-

useDdc

 (boolean, optional) - Add silence annotations

-

withSpeakerInfo

 (boolean, optional) - Return speaker information



Supported Languages:



| Language | Code | Recommended words_per_line |

|----------|------|---------------------------|

| Chinese (Simplified) | zh-CN | 15 |

| Cantonese | yue | 15 |

| English (US) | en-US | 55 |

| Japanese | ja-JP | 32 |

| Korean | ko-KR | 32 |

| Spanish | es-MX | 55 |

| Russian | ru-RU | 55 |

| French | fr-FR | 55 |



Example:

typescript

// Basic usage

await subtitle_generate({

  audioPath: "./video.mp4"

});



// With options

await subtitle_generate({

  audioPath: "./video.mp4",

  language: "zh-CN",

  wordsPerLine: 15,

  maxLines: 2,

  captionType: "speech",

  usePunc: true

});





Output:

- SRT file saved in the same directory as the input file

- JSON response with utterances and timing information



$3



Align existing subtitle text with audio for precise timing.



Parameters:

-

audioPath

 (string, required) - Audio/video file path (relative to WORKDIR or absolute)

-

subtitleText

 (string, required) - The subtitle text to align with the audio

-

captionType

 (enum, required) - 'speech' or 'singing'

-

staPuncMode

 (enum, optional) - Punctuation mode: '1', '2', or '3'



Punctuation Modes:

-

 (default) - Omit trailing punctuation from alignment results

-

 - Replace punctuation with spaces

-

 - Keep original punctuation



Example:

typescript

// Align speech subtitle

await subtitle_align({

  audioPath: "./speech.wav",

  subtitleText: "Hello, welcome to our presentation today.",

  captionType: "speech"

});



// Align song lyrics

await subtitle_align({

  audioPath: "./song.mp3",

  subtitleText: "这是一首美丽的歌曲",

  captionType: "singing",

  staPuncMode: "3"

});





Output:

- SRT file saved with

_aligned

 suffix

- JSON response with word-level timing information



Usage Examples



$3



Once configured as an MCP server, the tools are available through your MCP client:



> Generate subtitles for video.mp4

> Align these lyrics with song.mp3: "今天天气真好..."

$3

bash

Install globally

npm install -g @visionengine/subtitle-generate



Set environment variables

export APP_ID="your_app_id"

export ACCESS_TOKEN="your_access_token"

export WORKDIR="./media"



Run the server

ve-subtitle-generate





Output Format



The tools save subtitles in SRT format:



1

00:00:00,000 --> 00:00:03,197

如果您没有其他需要举报的话这边就先挂断了



2

00:00:03,442 --> 00:00:04,877

祝您生活愉快再见





Error Codes



| Code | Meaning | Description |

|------|---------|-------------|

| 0 | Success | - |

| 2000 | Processing | Task is being processed |

| 1001 | Invalid parameters | Missing/invalid request parameters |

| 1002 | No permission | Token invalid/expired |

| 1003 | Rate limited | QPS exceeded |

| 1010 | Audio too long | Duration exceeded threshold |

| 1012 | Invalid audio format | Audio decode failure |

| 1013 | Silent audio | No speech detected |



Development



$3

bash

pnpm build

$3

bash

pnpm test

$3

bash

Build first

pnpm build



Run locally

node dist/index.js

Support

For issues and questions:

- Email: team@visionengine-tech.com
- Website: https://visionengine-tech.com