Unicode-aware Han characters (hanzi, kanji, hanja) detection
npm install @scriptin/is-hanUnicode-aware Han characters (hanzi, kanji, hanja) detection
``shell`
npm i @scriptin/is-han
> Note You need to use Unicode-aware methods/operators in JavaScript -
> Array.from(str) and for/of loops -
> in order to process all Han characters. Some of them have code points
> which don't fit into 16 bits, and JavaScript uses UTF-16.
Examples of correct usage:
`js
import { isHan } from "@scriptin/is-han";
for (const char of "漢字") {
console.log(isHan(char));
}
// or
Array.from("漢字").filter(isHan)
`
Incorrect usage:
`js
'𠀋'.split('').filter(isHan); // -> empty array
// because code point of '𠀋' is '2000B' which is more than 16 bit long,
// so it is split into a surrogate pair
console.log('𠀋'.split('')); // -> ['\uD840', '\uDC0B']
// Compare to:
console.log(Array.from('𠀋')); // -> ['𠀋']
`
- isHan(char: string): boolean - Checks if a character is a Han script character: hanzi, kanji, hanja
- isHanExt(char: string): boolean - Checks if a character is an "extended" Han script character.
Useful when you're looking for obscure characters which contain Han script,
e.g. symbols like 🈲, 🈯, 🈳, 🉐, 🉑, ㊄, ㋋, ㏾, ㍰, etc.
"Extended" means all Unicode characters which:
- contain Han characters with additional wrappers, such as characters inside brackets, circles, etc.
- contain multiple "compacted" Han characters, such as Japanese "square era names", etc.
- contain parts of Han characters, such as CJK strokes
- 々 IDEOGRAPHIC ITERATION MARK (see below)
- 〆 IDEOGRAPHIC CLOSING MARK (see below)
- isIterationMark(char: string): boolean - Checks if character is 々 IDEOGRAPHIC ITERATION MARK.
This mark means "repeat previous character". Can be useful if you want to replace this mark with
the character it repeats/represents.
See Wiktionary article about 々
- isClosingMark(char: string): boolean - Checks if character is 〆 IDEOGRAPHIC CLOSING MARK.
This mark is used in place of another Han character.
See Wiktionary article about 〆
- Some constants are also exported in case you need to extend the functionality.
#### ❓ Why do I have to use Array.from(str) and for/of?
Because JavaScript (and TypeScript) use UTF-16 for strings, and some of more recent
additions into Unicode don't fit into 16 bit. In such cases, characters are represented
with surrogates.
Array.from() and for/of were added in more recent versions of ECMAScript and are Unicode-aware.
This library cannot change this JavaScript feature, so you have to use these two methods,
and avoid using Array.split(), String.codePointAt(), String.charCodeAt()`, etc.
#### ❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?
No. Because of the Han unification
most of CJK characters are represented with shared code points.
Each code point can be associated with multiple versions/variants of the same character,
including regional, stylistic, and other variations. In order to determine a language,
you need to know some context. For example, language can be set as an attribute
of a web page or a PDF document, or as a setting in an operating system.
This library doesn't provide methods to distinguish between languages.
#### ❓ Can I distinguish between Traditional and Simplified Chinese characters?
In some cases, yes. In others, traditional and simplified variants
share the same code points. See this article.
For a sufficiently big text, you can determine if it's traditional or simplified
by looking for specific code points.
This library doesn't provide methods to distinguish between traditional and simplified scripts.