Basic user input sanitization library
npm install s10n> s10n stands for _"sanitization"_.
> Just like _l10n_ stands for _"localization"_.
> See also i18n, l10n et al
A library to make basic user input sanitization
and subsequent validation an easier job.
- Use cases
- Example 1. Username
- Example 2. Arbitrary text
- API
- Modifiers
- Treating line break characters
- Line break character
- Elementary transformers
- Transform whitespaces
- Handle line breaks
- Keep/Remove/Replace
- Other transformations
- Compound transformers
- Semantic sanitizers
- Custom transformations
- Getting sanitized value
- Utility methods
Sanitization is NOT validation, but
it can help make validation an easier job
and/or help to suggest to a user an input variation
that better matches input expectations or requirements.
As with validation sanitization, if in place, should
be applied on both frontend and backend, since a user
can bypass sanitization and validation on the frontend and
send input directly to a backend endpoint.
Let's assume the following scenario of a username input.
The rule is that only a-z, A-Z, numbers, underscore and dash
are only expected in valid input.
A user submits a string of #UsEr #$%"' NaMe 5_6-9.
Input gets invalidated, the rule gets presented to the user,
and the user expected to remove all invalid characters.
The input then becomes a valid string of UsErNaMe5_6-9.
Alternatively an app might have suggested (or enforced)
a valid input. Examples below are demonstration of
default and tuned behaviour of a relevant semantic sanitizer
(spaces get replaced with underscores).
``javascript`
let input = " UsEr #$%' NaMe 5_6-9 ";
s10n(input).keepUsername().value; // "UsErNaMe5_6-9"
s10n(input).keepUsernameLC().value; // "username5_6-9"
s10n(input).keepUsername("_").value; // "UsEr_NaMe_5_6-9"
s10n(input).keepUsernameLC("_").value; // "user_name_5_6-9"
Semantic sanitizers applied are a combination of elementary and compound
transformers with an optional parameter to replace spaces
(in this particular use case).
Let's assume the input received from a user is
" \n\r\n \u200B\u200C\u200D\u2060 \t\uFEFF\xA0 Sensible text \n Line 2 \n\r\n\r\r "
Here are issues worth attention and optimization:
- it contains problematic whitespaces
- it contains sequences of 2 or more whitespaces
- it contains leading and trailing whitespaces
- there is a variety of line break characters,
potentially hazardous (CRLF injection)
- there are leading and trailing empty lines
- line break characters are invalid in a one line input
Any of the above can be considered as some unnecessarily
contaminated data.
Having all issues fixed the above input would have been:
- "Sensible text\nLine 2" for a multiline input"Sensible text Line 2"
- for a simple string input
`javascript
let input = " \n\r\n \u200B\u200C\u200D\u2060 \t\uFEFF\xA0 Sensible text \n Line 2 \n\r\n\r\r ";
s10n(input).minimizeWhitespaces().value; // "Sensible text Line 2"
s10n(input)
.preserveLineBreaks() // modifier for subsequent methods to preserve line breaks
.minimizeWhitespaces().value; // "Sensible text\nLine 2"
`
minimizeWhitespaces does the following:
- normalizes line break characters, i.e.
CRLF (\n\r) and individual CR (\r)\n
are converted into LF () (default behaviour)\x20
- normalizes whitespaces into standard space character ()
- merges continuous whitespaces into a single space character
- normalizes lines in a multiline input
(strips leading and trailing spaces in each line of a multiline input)
- trims leading and trailing whitespaces
- trims leading and trailing line breaks
s10n offers a number of elementary, compound and semantic
transformers and sanitizers as well as a method to apply
an arbitrary sanitizer.
Below are the usage examples to give a general impression
of the API.
`javascript
s10n(" Some text \n Yet basically valid \n\n ")
.preserveLineBreaks()
.minimizeWhitespaces().value; // "Some text\nYet basically valid"
s10n(" My User Name ").keepUsernameLC("_").value; // "my_user_name"
let input = " Some arbitrary \t \xA0 text ";
s10n(input)
.normalizeWhitespaces()
.trim().value; // "Some arbitrary text"
s10n(input)
.mergeWhitespaces()
.trim().value; // "Some arbitrary text"
`
Modifiers affect behaviour of subsequent transformers.
#### Treating line break characters
Defines whether to preserve or disregard line break
characters when applying transformers.
Default behaviour is to disregard line break characters.
This setting doesn't affect some transformers (e.g. trimLineBreaks()).
These are marked correspondingly.
`javascript`
let input = " \n\n\n ";
s10n(input).trim().value; // ""
s10n(input)
.preserveLineBreaks()
.trim().value; // "\n\n\n"
Call disregardLineBreaks when subsequent sanitizers shouldpreserveLineBreaks
disregard line breaks after any preceding transformations
has been affected by .
#### Line break character
By default, whenever any sanitizer affects line break characters
a \n is considered as a valid or target line break character.
This behaviour can be changed for subsequent sanitizers
(e.g. setLineBreakCharacter('\r')).\r\n
Whenever line breaks in a string get normalized
CRLF () is converted into a single line break character\n
( by default, or a value assigned by setLineBreakCharacter method).
`javascript
let input = "\r\n\n\n\r\r";
s10n(input).normalizeLineBreaks().value; // "\n\n\n\n\n"
s10n(input)
.setLineBreakCharacter("\r")
.normalizeLineBreaks().value; // "\r\r\r\r\r"
`
Elementary transformers have a pretty limited scope of responsibility.
Normally used for basic transformations
and as building blocks by compound transformers,
semantic sanitizers and custom transformers/sanitizers.
#### Transform whitespaces
s10n treats an extended set of characters, including
\x20\u200B\u200C\u200D\u2060\uFEFF\xA0 as whitespaces.\n
Characters and \r are not considered whitespacespreserveLineBreaks
when modifier applied.
- trim() - removes leading and trailing whitespaces
- trimLineBreaks() - always removes leading and trailing
line break characters, disregarding the LineBreak modifier setting
- mergeLineBreaks() - normalizes and merges consequent line breaks
disregarding the LineBreak modifier setting
- normalizeWhitespaces() - all whitespaces are
converted into space characters (\x20)mergeWhitespaces()
- - merges continuous clusters of whitespaces\x20
into a single space character ()stripWhitespaces()
- - strips all whitespaces from input
Examples:
`javascript
let input = "\n Z\tY \x0A \n X W\uFEFFV \n\n \n";
s10n(input).trim().value; // "Z\tY \x0A \n X W\uFEFFV"
s10n(input)
.preserveLineBreaks()
.trim().value; //
s10n(input).trimLineBreaks().value; // " Z\tY \x0A \n X W\uFEFFV \n\n "
s10n(input)
.preserveLineBreaks()
.trimLineBreaks().value; //
s10n("\n\r\r\r\n\nfoo\n\r\nbar\n\r\r\r\n\n").mergeLineBreaks().value; // "\nfoo\nbar\n"
s10n("\n \r\r\r\n\nfoo\n \r\nbar\n\r\r \r\n\n").mergeLineBreaks().value; // "\n \nfoo\n \nbar\n \n"
s10n(input).normalizeWhitespaces().value; // " Z Y X W V "
s10n(input)
.preserveLineBreaks()
.normalizeWhitespaces().value; // "\n Z Y \n X W V \n\n \n"
s10n(input).mergeWhitespaces().value; // " Z Y X W V "
s10n(input)
.preserveLineBreaks()
.mergeWhitespaces().value; // "\n Z Y \n X W V \n\n \n"
s10n(input).stripWhitespaces().value; // "ZYXWV"
s10n(input)
.preserveLineBreaks()
.stripWhitespaces()
.value(); // "\nZY\nXWV\n\n\n"
s10n(input)
.preserveLineBreaks()
.normalizeWhitespaces()
.mergeWhitespaces()
.trimLineBreaks()
.trim().value; // "Z Y \n X W V \n\n"
`
See also normalizeLineBreaks()
#### Handle line breaks
- normalizeLineBreaks(lineBreakCharacter = undefined) -
transforms CRLF, CR, LF into a line break character defined following the rules below:
- as specified by lineBreakCharacter argumentlineBreakCharacter
- if param is undefined, then as set by setLineBreakCharacter()setLineBreakCharacter()
- if wasn't applied, then defaults to LF ('\n')normalizeMultiline()
- -
strips whitespaces that immediately precede
or follow line break characters;
ignores LineBreak modifier setting
Examples:
`javascript`
let input = "\r\n\r abc \r\n def \r \t ghi \n \t\t \n \r\n\n\r\n\n\r\r \r\r\n";
s10n(input).normalizeLineBreaks().value; // "\n\n abc \n def \n \t ghi \n \t\t \n \n\n\n\n\n\n \n\n"
s10n(input).normalizeMultiline().value; // "\r\n\rabc\r\ndef\rghi\n\n\r\n\n\r\n\n\r\r\r\r\n"
s10n(input)
.normalizeLineBreaks()
.normalizeMultiline().value; // "\n\nabc\ndef\nghi\n\n\n\n\n\n\n\n\n\n"
See also minimizeWhitespaces()
#### Keep/Remove/Replace
These methods' behaviour is NOT affected
by LineBreak modifier
(disregarded by default, i.e. \s RegExp token comprises \r and \n).\n
Specify and/or \r
explicitly whenever those should be kept or removed.
Method argument should follow RegExp character class
specification.
- keepOnlyCharset(allowedChars = "-A-Za-z0-9_\\x20.,}{\\]\\)(", regexpFlags) -
keep listed characters only
- keepOnlyRegExp(regexp, regexpFlags) - keep characters as per RegExp
(RegExp object or regexp body as a string)
- remove(disallowedChars, regexpFlags) - remove listed characters
- replace(needle, replacement = "", regexpFlags) -
replaces a needle (which is a string, or a RegExp object)
with the replacement string
regexpFlags in the methods above is an optional parameter and_regexp
defaults to the flags as specified in [ ("gu").
Examples:
`javascript
let input = "ABCDabcd01239 _-.,(abcd){defg}[hijk]";
s10n(input1).keepOnlyCharset("}{][)(").value; // "(){}[]"
s10n(input1).keepOnlyRegexp(/\{.?\}|\[.?\]|\(.*?\)/gu).value; // "(abcd){defg}[hijk]"
let input2 = "ABCDEFGHabcdefghABCDEFGHabcdefgh";
s10n(input2).remove("ABCD").value; // "EFGHabcdefghEFGHabcdefgh"
s10n(input2).remove("ABCD", "giu").value; // "EFGHefghEFGHefgh"
s10n(input2).remove(/ABCD/).value; // "EFGHabcdefghABCDEFGHabcdefgh"
s10n(input2).remove(/ABCD/giu).value; // "EFGHefghEFGHefgh"
s10n(input2).remove(/ABCD/, "giu").value; // "EFGHefghEFGHefgh"
`
#### Other transformations
- toLowerCase() - converts to lower case
- toUpperCase() - converts to upper case
Examples:
`javascript`
let input = "aBcD01";
s10n(input).toLowerCase().value; // "abcd01"
s10n(input).toUpperCase().value; // "ABCD01"
Compound transformers implement complex
transformation rules applying multiple transformations,
often using elementary transformers.
- keepBase10Digits() - strips out anything but 0-9keepBase16Digits()
- - (alias: keepHexDigits()) -0-9a-fA-F
strips out anything but ;toLowerCase()
best combined chained with or toUpperCase()minimizeWhitespaces()
for consistent result
- - removes leading, trailingpreserveLineBreaks()
and continuous clusters of whitespaces and line breaks;
when preceded with treats input as
a multiline string and thus trims spaces in every line
Examples:
`javascript
let input1 = " XYZ 20fE\n\n ";
s10n(input1).keepBase10Digits().value; // 20
s10n(input1).keepBase16Digits().value; // 20fE
s10n(input1)
.keepBase16Digits()
.toLowerCase().value; // 20fe
s10n(input1)
.keepHexDigits()
.toLowerCase().value; // 20fE
let input2 = " Some text \n Yet basically valid \n\n ";
s10n(input2).minimizeWhitespaces().value; // "Some text Yet basically valid"
s10n(input2)
.preserveLineBreaks()
.minimizeWhitespaces().value; // "Some text\nYet basically valid"
`
Semantic sanitizers implement semantically meaningful
yet heavily opinionated sanitization rules for particular use cases.
- keepOnlyEmailPopularCharset(commonUse = false) - by default keeps a charset
as per rfc ( A-Za-z0-9_\\-@.+)( \":;<>\\\\,\\[\\]}{!#$%&'*/=?^|~ `);true
pass if a lesser (more common) charset ( A-Za-z0-9_\-@.+)( )keepUsername(whiteSpaceReplacement = "")
fits better your particular use case
- - keepsa-zA-Z0-9_-
only , whitespaces are stripped orwhiteSpaceReplacement
are merged and replaced with if anykeepUsernameLC(whiteSpaceReplacement = "")
specified
- - same askeepUsername
but the result is converted to lower case
Examples:
`javascript
let input = " UsEr #$%\"' NaMe + 5_6-9 @ .Co.Uk ";
s10n(input).keepOnlyEmailPopularCharset(true).value; // "UsErNaMe+5_6-9@.co.uk"
s10n(input)
.keepOnlyEmailPopularCharset(true)
.toLowerCase().value; // "username+5_6-9@.co.uk"
s10n(input).keepUsername().value; // "UsErNaMe5_6-9CoUk"
s10n(input).keepUsernameLC().value; // "username5_6-9couk"
s10n(input).keepUsername("_").value; // "UsEr_NaMe_5_6-9_CoUk"
s10n(input).keepUsernameLC("_").value; // "user_name_5_6-9_couk"
`
Note: sanitized email input is still invalid but (arguably)
yet easier to double-check and fix.
What if those semantic sanitizers do not fit my needs?
Consider implementing a customized transformer.
A custom transformer is a method to apply complex sanitization
logic using elementary or compound transformers, semantic sanitizers
or applying a completely unique rule set.
- apply(callback, ...arguments) - callback will receive
current value, calling context (reference to current s10n object as this),extend(methodName, method)
and any extra arguments passed
- -extend
registers a re-usable custom transformation method
- should be called on s10n object itself rather than inthis.value
a sanitization chain
- the method is accessible at every sanitization chain once registered
- the method should transform and/or call otherthis
built-in or registered custom transformers/sanitizers
- the method should return to make it chainable
- do not define the method as an arrow function
Example:
`javascript
s10n("c00l").apply(
(value, context, needle, replacement) => value.replace(context._regexp(needle), replacement),
"0",
"o"
).value; // "cool"
s10n.extend("makeCool", function() {
// replaces 'o' and 'O' followed with whitespaces (extended set) with a single '0'
this.replace(this._regexp("o\\s+", "gi"), "0");
return this;
});
s10n("coO\x0A o\t l").makeCool().value; // "co00l"
`
Getting sanitized value (as a string)
is as simple as terminating
transformation chain with .value.s10n(" my User Name ").usernameLC().value
E.g. ..value
In string context is optional as a string
is being returned by default.
E.g. Username: ${s10n(" my User Name ").usernameLC()} s10n(" my User Name ").usernameLC() + ''
or
Explicit value access methods:
- value - value as is
- toString() - same as .valuetoNumber()
- - converts sanitized stringNaN
into a Number. Use with caution as it will return
if sanitized string contains anything else but a valid
Number literal.
Examples:
`javascript${s10n(input)}
let input = "65";
s10n(input).value; // "65"
s10n(input).toString(); // "65"; // "65"`
s10n(input) + ""; // "65"
s10n(input).toNumber(); // 65
- _regexp(patternString, flags = "gu") -
using this utility will ensure that \s entitiesapply
in pattern string are replaced with an extended set
of whitespaces. Recommended for use in callback.
Example:
`javascript``
s10n("\t \xA0 ABC\n\t \uFEFF").apply((value, context) =>
// replaces extended set of whitespaces with dashes
value.replace(context._regexp("\\s"), "-")
).value; // "----ABC----"
Refer to CONTRIBUTING.md for details.