A javascript library for removing accents, diacritics (and more) from UTF8 strings
npm install ubase.jsubase.js is a javascript library for removing accents, diacritics
(and more) from utf8 strings.
Many utf8 characters are "based" on latin letters; that's clear for
accents, like "é" which is based on "e", but also for more rare
symbols like "🅴" or "Ǝ" ! The idea of this simple library is to give
you back the base letter of these characters.
```
npm install ubase.js
or simply copy the ubase.js file where you need it.
libraryYou just need the ubase.js file. Usage is straighforward. The mainbasify
function is :
`js`
> const ubase = require ("ubase.js");
undefined
> ubase.basify ('Bøǹĵöůɍ');
'Bonjour'
If you just copied the ubase.js file to your current directory,
replace the first line above by:
`js`
> const ubase = require ("./ubase.js");
You may control the behaviour of basify in case of malformed
utf8, or non-latin characters:
+ set_malformed ( s ) : the given string s will be used to replaceset_strip ( s )
any malformed utf8 char (which should almost never happen in
Javascript). Default is '?'.
+ : s can be either a string, or undefined. Ifs
is a string, it will replace any non-ASCII utf8 char that is nots
based on a latin char, like '→'. It is allowed for to be thes
empty string (hence the name "strip"). If is undefined, no
replacement takes place (this is the default).
If both malformed and strip contain only ASCII characters, thenbasify
the result of is guaranteed to contain only ASCII
characters.
Other helper functions:
+ isolatin_to_utf8 ( s ): convert the isolatin-encoded string s tocp1252_to_utf8 ( s )
utf8.
+ : convert the cp1252-encoded (aka Windowss
encoding) string to utf8.
`html
`
executableThe standalone executable version of ubase is ubasex.js. You cannode
test it with :
``
$ node ubasex.js Bøǹĵöůɍ
Bonjour
This library is automatically generated from the
OCaml ubase version using
js-of-ocaml.
ubase.js` covers more than 2000 utf8 chars, it should be quite
complete. File an issue if some character is not properly basified.