:warning: IMPORTANT UPDATE :warning:

Our library @lenml/llama2-tokenizer has been deprecated. We are excited to introduce our new library @lenml/tokenizers as its replacement, offering a broader set of features and an enhanced experience.

Why switch to `@lenml/tokenizers`?

- Fully Compatible with transformers.js Interfaces: Seamlessly supports all interfaces defined in transformers.js, making migration and integration effortless.
- Support for a Wide Range of Models: Regardless of which model you need, our new library supports it, ensuring broader coverage.
- Rich Feature Implementation: Includes a complete implementation of chat templates and normalizers to better serve your text processing and tokenization needs.

check out lenML/tokenizers.

----

🦙Llama2 Tokenizer for JavaScript

Llama2 Tokenizer is a TypeScript library for tokenizing and encoding text using the Llama2 vocabulary.

Suitable for browser and nodejs environment.

> online playground: https://lenml.github.io/llama-tokenizer-playground/
>
> (vocab: llama2)

Features

- fast
- API like Llama2Tokenizer (python)
- typescript
- 95% test coverage

support models

- llama2
- mistral
- zephyr
- vicuna
- baichuan2
- chatglm3
- internlm2
- yi
- ...

Why llama2 ?

llama2's vocab is different from llama1, so a new tokenizer needs to be defined to adapt to llama2's vocab

Packages

| Library Name | Description | Compatibility |
|---------------------------------------|-------------------------------------------|-------------------------------------------------------|
| @lenml/llama2-tokenizer | Tokenizer library for text segmentation | |
| @lenml/llama2-tokenizer-vocab-llama2 | Vocabulary for llama2 | mistral, zephyr, vicuna, llama2 |
| @lenml/llama2-tokenizer-vocab-baichuan2 | Vocabulary for baichuan2 | baichuan2 |
| @lenml/llama2-tokenizer-vocab-chatglm3 | Vocabulary for chatglm3 | chatglm3 |
| @lenml/llama2-tokenizer-vocab-internlm2 | Vocabulary for internlm2 | internlm2 |
| @lenml/llama2-tokenizer-vocab-yi | Vocabulary for yi | yi |
| @lenml/llama2-tokenizer-vocab-falcon | Vocabulary for falcon (🚧WIP) | falcon (🚧WIP) |
| @lenml/llama2-tokenizer-vocab-neox | Vocabulary for neox (🚧WIP) | neox, RWKV (🚧WIP) |
| @lenml/llama2-tokenizer-vocab-emoji | a vocab demo (🚧WIP) | 🚧WIP |

This table lists the name of each library, its description, and its compatibility.

Installation

bash

npm install @lenml/llama2-tokenizer

$3

bash

npm install @lenml/llama2-tokenizer-vocab-llama2

npm install @lenml/llama2-tokenizer-vocab-baichuan2

npm install @lenml/llama2-tokenizer-vocab-chatglm3

npm install @lenml/llama2-tokenizer-vocab-falcon

npm install @lenml/llama2-tokenizer-vocab-internlm2

npm install @lenml/llama2-tokenizer-vocab-yi





Usage



$3

typescript

import { Llama2Tokenizer } from "@lenml/llama2-tokenizer";

import { load_vocab } from "@lenml/llama2-tokenizer-vocab-llama2"

$3

typescript

const tokenizer = new Llama2Tokenizer();

const vocab_model = load_vocab();

tokenizer.install_vocab(vocab_model);

$3

typescript

const text = "你好，世界！";

const tokens = tokenizer.tokenize(text);

console.log(tokens);

// Output: ["你", "好", "，", "世", "界", "！"]

$3

typescript

const text = "你好，世界！";

const ids = tokenizer.encode(text);

console.log(ids);

// Output: [2448, 1960, 8021, 1999, 1039, 8013]

$3

typescript

const ids = [2448, 1960, 8021, 1999, 1039, 8013];

const decodedText = tokenizer.decode(ids);

console.log(decodedText);

// Output: "你好，世界！"

$3

typescript

tokenizer.add_special_token("");

tokenizer.add_special_tokens(["<|im_start|>", "<|im_end|>"]);





> It is not recommended to use

`[XX]` (like `[CLS]` or `[PAD]`) as a special token for this pattern, as it can easily lead to conflicts. Because `"_["`

 is also a usable token, it is difficult to be compatible with this bad case without adjusting the word list order.



$3

typescript

const vocabulary = tokenizer.get_vocab();

console.log(vocabulary);

// Output: { "你": 2448, "好": 1960, "，": 8021, "世": 1999, "界": 1039, "！": 8013, ... }





$3



-

vocab_size

: Get the total vocabulary size.

-

max_id

: Get the maximum token ID.

-

convert_tokens_to_string

: Convert a sequence of tokens to a single string.

-

convert_tokens_to_ids

: Convert a sequence of tokens to a sequence of IDs.

-

convert_ids_to_tokens

: Convert a sequence of IDs to a sequence of tokens.



Example

typescript

import { Llama2Tokenizer } from "@lenml/llama2-tokenizer";

import { load_vocab } from "@lenml/llama2-tokenizer-vocab-llama2"



const main = async () => {

  const tokenizer = new Llama2Tokenizer();

  const vocab_model = load_vocab();

  tokenizer.install_vocab(vocab_model);

  console.log(tokenizer.tokenize("你好，世界！"));

  console.log(tokenizer.encode("你好，世界！"));

  console.log(tokenizer.decode([29383, 29530, 28924, 30050, 29822, 29267]));

};



main();





Benchmark

We conducted a benchmark test to measure the performance of the Llama2 Tokenizer in tokenizing a given text for a specified number of iterations. The results for 1000 iterations are as follows:



Input Text:



Click to expand




🌸🍻🍅🍓🍒🏁🚩🎌🏴🏳️🏳️‍🌈



Lorem ipsum dolor sit amet, duo te voluptua detraxit liberavisse, vim ad vidisse gubergren consequuntur, duo noster labitur ei. Eum minim postulant ad, timeam docendi te per, quem putent persius pri ei. Te pro quodsi argumentum. Sea ne detracto recusabo, ius error doming honestatis ut, no saepe indoctum cum.



Ex natum singulis necessitatibus usu. Id vix brute docendi imperdiet, te libris corrumpit gubergren sea. Libris deleniti placerat an qui, velit atomorum constituto te sit, est viris iriure convenire ad. Feugait periculis at mel, libris dissentias liberavisse pri et. Quo mutat iudico audiam id.







Results:

bash

Benchmark Results (1000 iterations):

Total Time: 0.88822 seconds

Average Time per Iteration: 0.00089 seconds

TODOs

- [x] support llama2 vocab
- [x] support chatglm vocab
- [x] support baichuan vocab
- [x] support yi vocab
- [x] support internlm2 vocab
- [ ] support RWKV(neox) vocab
- [ ] support falcon
- [ ] Chat Template

How to build

read this

License

This project is licensed under the MIT License - see the LICENSE file for details.

:warning: IMPORTANT UPDATE :warning:

Why switch to `@lenml/tokenizers`?

🦙Llama2 Tokenizer for JavaScript

Features

- fast
- API like Llama2Tokenizer (python)
- typescript
- 95% test coverage

support models

- llama2
- mistral
- zephyr
- vicuna
- baichuan2
- chatglm3
- internlm2
- yi
- ...

Why llama2 ?

llama2's vocab is different from llama1, so a new tokenizer needs to be defined to adapt to llama2's vocab

Packages

Installation

bash

npm install @lenml/llama2-tokenizer

$3

bash

npm install @lenml/llama2-tokenizer-vocab-llama2

npm install @lenml/llama2-tokenizer-vocab-baichuan2

npm install @lenml/llama2-tokenizer-vocab-chatglm3

npm install @lenml/llama2-tokenizer-vocab-falcon

npm install @lenml/llama2-tokenizer-vocab-internlm2

npm install @lenml/llama2-tokenizer-vocab-yi





Usage



$3

typescript

import { Llama2Tokenizer } from "@lenml/llama2-tokenizer";

import { load_vocab } from "@lenml/llama2-tokenizer-vocab-llama2"

$3

typescript

const tokenizer = new Llama2Tokenizer();

const vocab_model = load_vocab();

tokenizer.install_vocab(vocab_model);

$3

typescript

const text = "你好，世界！";

const tokens = tokenizer.tokenize(text);

console.log(tokens);

// Output: ["你", "好", "，", "世", "界", "！"]

$3

typescript

const text = "你好，世界！";

const ids = tokenizer.encode(text);

console.log(ids);

// Output: [2448, 1960, 8021, 1999, 1039, 8013]

$3

typescript

const ids = [2448, 1960, 8021, 1999, 1039, 8013];

const decodedText = tokenizer.decode(ids);

console.log(decodedText);

// Output: "你好，世界！"

$3

typescript

tokenizer.add_special_token("");

tokenizer.add_special_tokens(["<|im_start|>", "<|im_end|>"]);





> It is not recommended to use

`[XX]` (like `[CLS]` or `[PAD]`) as a special token for this pattern, as it can easily lead to conflicts. Because `"_["`

 is also a usable token, it is difficult to be compatible with this bad case without adjusting the word list order.



$3

typescript

const vocabulary = tokenizer.get_vocab();

console.log(vocabulary);

// Output: { "你": 2448, "好": 1960, "，": 8021, "世": 1999, "界": 1039, "！": 8013, ... }





$3



-

vocab_size

: Get the total vocabulary size.

-

max_id

: Get the maximum token ID.

-

convert_tokens_to_string

: Convert a sequence of tokens to a single string.

-

convert_tokens_to_ids

: Convert a sequence of tokens to a sequence of IDs.

-

convert_ids_to_tokens

: Convert a sequence of IDs to a sequence of tokens.



Example

typescript

import { Llama2Tokenizer } from "@lenml/llama2-tokenizer";

import { load_vocab } from "@lenml/llama2-tokenizer-vocab-llama2"



const main = async () => {

  const tokenizer = new Llama2Tokenizer();

  const vocab_model = load_vocab();

  tokenizer.install_vocab(vocab_model);

  console.log(tokenizer.tokenize("你好，世界！"));

  console.log(tokenizer.encode("你好，世界！"));

  console.log(tokenizer.decode([29383, 29530, 28924, 30050, 29822, 29267]));

};



main();





Benchmark

We conducted a benchmark test to measure the performance of the Llama2 Tokenizer in tokenizing a given text for a specified number of iterations. The results for 1000 iterations are as follows:



Input Text:



Click to expand




🌸🍻🍅🍓🍒🏁🚩🎌🏴🏳️🏳️‍🌈



Lorem ipsum dolor sit amet, duo te voluptua detraxit liberavisse, vim ad vidisse gubergren consequuntur, duo noster labitur ei. Eum minim postulant ad, timeam docendi te per, quem putent persius pri ei. Te pro quodsi argumentum. Sea ne detracto recusabo, ius error doming honestatis ut, no saepe indoctum cum.



Ex natum singulis necessitatibus usu. Id vix brute docendi imperdiet, te libris corrumpit gubergren sea. Libris deleniti placerat an qui, velit atomorum constituto te sit, est viris iriure convenire ad. Feugait periculis at mel, libris dissentias liberavisse pri et. Quo mutat iudico audiam id.







Results:

bash

Benchmark Results (1000 iterations):

Total Time: 0.88822 seconds

Average Time per Iteration: 0.00089 seconds

TODOs

How to build

read this

License

This project is licensed under the MIT License - see the LICENSE file for details.

@lenml/llama2-tokenizer

:warning: IMPORTANT UPDATE :warning:

Why switch to @lenml/tokenizers?

🦙Llama2 Tokenizer for JavaScript

Features

support models

Why llama2 ?

Packages

Installation

$3

npm install @lenml/llama2-tokenizer-vocab-baichuan2

npm install @lenml/llama2-tokenizer-vocab-chatglm3

npm install @lenml/llama2-tokenizer-vocab-falcon

npm install @lenml/llama2-tokenizer-vocab-internlm2

npm install @lenml/llama2-tokenizer-vocab-yi

Usage

$3

$3

$3

$3

$3

$3

$3

$3

Example

Benchmark

TODOs

How to build

License

@lenml/llama2-tokenizer

:warning: IMPORTANT UPDATE :warning:

Why switch to @lenml/tokenizers?

🦙Llama2 Tokenizer for JavaScript

Features

support models

Why llama2 ?

Packages

Installation

$3

npm install @lenml/llama2-tokenizer-vocab-baichuan2

npm install @lenml/llama2-tokenizer-vocab-chatglm3

npm install @lenml/llama2-tokenizer-vocab-falcon

npm install @lenml/llama2-tokenizer-vocab-internlm2

npm install @lenml/llama2-tokenizer-vocab-yi

Usage

$3

$3

$3

$3

$3

$3

$3

$3

Example

Benchmark

TODOs

How to build

License

Why switch to `@lenml/tokenizers`?

Why switch to `@lenml/tokenizers`?