purify-html

![codecov](https://codecov.io/gh/oleksandr-dukhovnyy/purify-html)
!GitHub issues
!GitHub closed issues

A minimalist client library for cleaning up strings so they can be safely used as HTML.

Do simple things simply: zero configuration to completely strip a string of HTML, with the ability to incrementally customize as needed.

Try it! (demo at codesandbox)

The basic idea is to use the browser API to parse and modify the DOM.
Thus, several goals are achieved at once:

- Actual native parser. The library uses the HTML parser that the browser will use to parse the HTML. Thus, there will be no situation when the payload will work in the browser, but does not work in the parser that the library uses.
- Huge bundle size savings. Since the HTML standard is quite extensive, a quality parser is a rather heavy program that can be easily dispensed with.
- Speed of work. The parser inside DOMParser is native, i.e. written not in JavaScript, but in high-performance C++ (for example, in v8). Thus, parsing is faster than any JavaScript library.
- High % browser support. Although DOMParser has extensive support, you can use a polyfill or even your own parser. See parserSetup for details.

As a result, parsing with DOMParser is more reliable, faster and does not require precious kilobytes of space in the build.

---

Install

npm

``

bash

npm install purify-html





yarn

bash

yarn add purify-html

CDN

html





---



API

javascript

import PurifyHTML, { setParser } from 'purify-html';



const sanitizer = new PurifyHTML(options);

$3

javascript

setParser({

  parse(str: string): Element

  stringify(elem: Element): string

}): void





See here for details.








$3



####

sanitize(string): string





Performs string cleanup according to the rules passed in

options

javascript

import PurifyHTML, { setParser } from 'purify-html';



const sanitizer = new PurifyHTML(options);



const untrustedString = '...';



console.log(sanitizer.sanitize(untrustedString));










####

toHTMLEntities(string): string





Coerces a string to HTML entities. Very similar to escaping, but for the HTML interpreter.

In HTML, such characters will be rendered "as is".

See more here.

javascript

const str = '
';



console.log(

  sanitizer.toHTMLEntities(str); // => '<br />', displays at the page like '
'

);





---



$3



An array with rules for the sanitizer.

typescript

// Deprecated!

type AttributeRulePresetName =

  | '%correct-link%'

  | '%http-link%'

  | '%https-link%'

  | '%ftp-link%'

  | '%https-link-without-search-params%'

  | '%http-link-without-search-params%'

  | '%same-origin%';



interface AttributeRule = {

  // attribute name

  name: string;



  // rules for attribute value

  value?:

    | string

    | string[]

    | RegExp

    | { preset: AttributeRulePresetName } // Deprecated!

    | ((attributeValue: string) => boolean);



};



interface TagRule = {

  // tagname

  name: string;



  // rules for attributes

  attributes: AttributeRule[];



  // Don't remove comments in THIS node.

  // Comments in children nodes will be saved

  dontRemoveComments?: boolean;

  /*

    Example:



    Config

    [

      { name: 'div', dontRemoveComments: true },

      'span'

    ]



    Input:

    

      

        

    



    Output:

    

      

        

    

  */

};

javascript

import PurifyHTML, { setParser } from 'purify-html';



const sanitizer = new PurifyHTML([

  'hr',

  { name: 'br' },

  { name: 'img', attributes: [{ name: 'src' }] },

  {

    name: 'a',

    attributes: [

      { name: 'target', value: ['_blank', '_self', '_parent', '_top'] },

      { name: 'href', value: /^https?:\/\/*/ },

    ],

  },

]);





NOTE When using regular expressions to check for untrusted strings, don't forget to check your regular expressions for ReDoS vulnerabilities.

_A successful exploit of the ReDoS vulnerability is to cause the program to hang when trying to parse a specially crafted string._ 


See more: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS



Examples



via bundler:

javascript

import PurifyHTML from 'purify-html';



const allowedTags = [

  // only string

  'hr',



  // as object

  { name: 'br' },



  // attributes check

  { name: 'b', attributes: ['class'] },



  // advanced attributes check

  { name: 'p', attributes: [{ name: 'class' }] },



  // check attributes values (string)

  { name: 'strong', attributes: [{ name: 'id', value: 'awesome-strong-id' }] },



  // check attributes values (RegExp)

  { name: 'em', attributes: [{ name: 'id', value: /awesome-em-id?s/ }] },



  // check attributes values (array of strings)

  {

    name: 'em',

    attributes: [

      { name: 'id', value: ['awesome-strong-id', 'other-awesome-strong-id'] },

    ],

  },



  // check attribute value (function)

  {

    name: 'em',

    attributes: [{ name: 'id', value: value => value.startsWith('awesome-') }],

  },



  // use attributes checks preset (Deprecated)

  {

    name: 'a',

    attributes: [{ name: 'href', value: { preset: '%https-link%' } }], // presets are deprecated

  },

];



const sanitizer = new PurifyHTML(allowedTags);



const dangerString =

img src="1" onerror="alert(1)">

Bold
Bold
Bold

123

;



const safeString = sanitizer.sanitize(dangerString);



console.log(safeString);



/*

  <img src="1" onerror="alert(1)">

  



  Bold

  Bold

  Bold

  123

  

  

*/





Try it.



---



$3

html





Usage for the browser is slightly different from usage with faucets. This is bad, but it had to be done in order not to clog the global scope.



---



HTML comments



$3



For example the line:

.



Technically, inserting it into the DOM will not lead to code execution, but it cannot be considered safe either. The result of the

sanitize

 method is declared to be sanitized using the rules specified when the sanitizer was initialized.



> NOTE


> CDATA comments (or CDATA sections), although made for XML, are also supported in HTML. In

purify-html

 they are treated as regular HTML comments.
 See more about CDATA Section on MDN



Therefore, you are given the opportunity to control in which places you will leave comments, and in which not.



$3



By default, HTML comments are stripped. You can change it like this:

javascript

import PurifyHTML from 'purify-html';



const sanitizer = new PurifyHTML(['#comments' / ... /]);



sanitizer.sanitize(/ ... /);





If you want comments to be removed everywhere except for specific tags, then you can specify it like this:

javascript

import PurifyHTML from 'purify-html';



const rules = ['#comments', { name: 'div', dontRemoveComments: true }];

const sanitizer = new PurifyHTML(rules);



sanitizer.sanitize(/ ... /);





node-js



When used in an environment where the standard DOMParser is absent, you need to install a parser manually.



For example:

js

import { JSDOM } from 'jsdom';

global.DOMParser = new JSDOM().window.DOMParser;



import PurifyHTML from 'purify-html';

const sanitizer = new PurifyHTML(); // works

Or

js

import { JSDOM } from 'jsdom';

import PurifyHTML, { setParser } from 'purify-html';



// Scope elem variable, reuse DOMParser instance for performance

{

  const elem: Element = new DOMParser()

    .parseFromString('', 'text/html')

    .querySelector('body');



  // Set methods

  setParser({

    parse(string: string): Element {

      elem.innerHTML = string;



      return elem;

    },

    stringify(elem: Element): string {

      return elem.innerHTML;

    },

  });

}





setParser



In some cases, you may want to be able to use your parser instead of DOMParser.



This can be done like this:

javascript

import PurifyHTML, { setParser } from 'purify-html';



setParser({

  parse(HTMLString: string): HTMLElement {

    // ...

  },

  stringify(element: HTMLElement): string {

    // ...

  },

});



const sanitizer = new PurifyHTML();

// ...





Note! The root element will be passed to the

stringify

 function, and the CONTENT of the element will be expected as a result.

javascript

const input = document.createElement('div');



input.innerHTML = 'span';



stringify(input); // 'span' => OK

stringify(input); // 'span' => WRONG

---

`Why not` document.createElement(...)





Because by processing the string like this:

javascript

const parse = str => {

  const node = document.createElement('div');

  div.innerHTML = str;



  return div;

};





In fact, this function, having received a special payload, will RUN it. The following payload will send a network request:





And in the case of using DOMParser, the code does not run.



---



Attributes checks preset list



-

%correct-link%

 - only currect link

-

%http-link%

 - only http link

-

%https-link%

 - only https link

-

%ftp-link%

 - only ftp link

-

%https-link-without-search-params%

 - delete all search params and force https protocol

-

%http-link-without-search-params%

 - delete all search params and force https protocol

-

%same-https-origin% - only link that lead to the same origin that is currently in self.location.origin

. + force https protocol

-

%same-http-origin% - only link that lead to the same origin that is currently in self.location.origin`. + force http protocol

---

Browser support

Although browser support is already over 97%, the specification for DOMParser is not yet fully established. More details.

!Caniuse

---

If you find this project helpful, please give it a ⭐️ on GitHub to show your support. I would also appreciate it if you shared it with others who might find it useful!

purify-html

Install

npm

``

bash

npm install purify-html





yarn

bash

yarn add purify-html

CDN

html





---



API

javascript

import PurifyHTML, { setParser } from 'purify-html';



const sanitizer = new PurifyHTML(options);

$3

javascript

setParser({

  parse(str: string): Element

  stringify(elem: Element): string

}): void





See here for details.








$3



####

sanitize(string): string





Performs string cleanup according to the rules passed in

options

javascript

import PurifyHTML, { setParser } from 'purify-html';



const sanitizer = new PurifyHTML(options);



const untrustedString = '...';



console.log(sanitizer.sanitize(untrustedString));










####

toHTMLEntities(string): string





Coerces a string to HTML entities. Very similar to escaping, but for the HTML interpreter.

In HTML, such characters will be rendered "as is".

See more here.

javascript

const str = '
';



console.log(

  sanitizer.toHTMLEntities(str); // => '<br />', displays at the page like '
'

);





---



$3



An array with rules for the sanitizer.

typescript

// Deprecated!

type AttributeRulePresetName =

  | '%correct-link%'

  | '%http-link%'

  | '%https-link%'

  | '%ftp-link%'

  | '%https-link-without-search-params%'

  | '%http-link-without-search-params%'

  | '%same-origin%';



interface AttributeRule = {

  // attribute name

  name: string;



  // rules for attribute value

  value?:

    | string

    | string[]

    | RegExp

    | { preset: AttributeRulePresetName } // Deprecated!

    | ((attributeValue: string) => boolean);



};



interface TagRule = {

  // tagname

  name: string;



  // rules for attributes

  attributes: AttributeRule[];



  // Don't remove comments in THIS node.

  // Comments in children nodes will be saved

  dontRemoveComments?: boolean;

  /*

    Example:



    Config

    [

      { name: 'div', dontRemoveComments: true },

      'span'

    ]



    Input:

    

      

        

    



    Output:

    

      

        

    

  */

};

javascript

import PurifyHTML, { setParser } from 'purify-html';



const sanitizer = new PurifyHTML([

  'hr',

  { name: 'br' },

  { name: 'img', attributes: [{ name: 'src' }] },

  {

    name: 'a',

    attributes: [

      { name: 'target', value: ['_blank', '_self', '_parent', '_top'] },

      { name: 'href', value: /^https?:\/\/*/ },

    ],

  },

]);





NOTE When using regular expressions to check for untrusted strings, don't forget to check your regular expressions for ReDoS vulnerabilities.

_A successful exploit of the ReDoS vulnerability is to cause the program to hang when trying to parse a specially crafted string._ 


See more: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS



Examples



via bundler:

javascript

import PurifyHTML from 'purify-html';



const allowedTags = [

  // only string

  'hr',



  // as object

  { name: 'br' },



  // attributes check

  { name: 'b', attributes: ['class'] },



  // advanced attributes check

  { name: 'p', attributes: [{ name: 'class' }] },



  // check attributes values (string)

  { name: 'strong', attributes: [{ name: 'id', value: 'awesome-strong-id' }] },



  // check attributes values (RegExp)

  { name: 'em', attributes: [{ name: 'id', value: /awesome-em-id?s/ }] },



  // check attributes values (array of strings)

  {

    name: 'em',

    attributes: [

      { name: 'id', value: ['awesome-strong-id', 'other-awesome-strong-id'] },

    ],

  },



  // check attribute value (function)

  {

    name: 'em',

    attributes: [{ name: 'id', value: value => value.startsWith('awesome-') }],

  },



  // use attributes checks preset (Deprecated)

  {

    name: 'a',

    attributes: [{ name: 'href', value: { preset: '%https-link%' } }], // presets are deprecated

  },

];



const sanitizer = new PurifyHTML(allowedTags);



const dangerString =

img src="1" onerror="alert(1)">

Bold
Bold
Bold

123

;



const safeString = sanitizer.sanitize(dangerString);



console.log(safeString);



/*

  <img src="1" onerror="alert(1)">

  



  Bold

  Bold

  Bold

  123

  

  

*/





Try it.



---



$3

html





Usage for the browser is slightly different from usage with faucets. This is bad, but it had to be done in order not to clog the global scope.



---



HTML comments



$3



For example the line:

.



Technically, inserting it into the DOM will not lead to code execution, but it cannot be considered safe either. The result of the

sanitize

 method is declared to be sanitized using the rules specified when the sanitizer was initialized.



> NOTE


> CDATA comments (or CDATA sections), although made for XML, are also supported in HTML. In

purify-html

 they are treated as regular HTML comments.
 See more about CDATA Section on MDN



Therefore, you are given the opportunity to control in which places you will leave comments, and in which not.



$3



By default, HTML comments are stripped. You can change it like this:

javascript

import PurifyHTML from 'purify-html';



const sanitizer = new PurifyHTML(['#comments' / ... /]);



sanitizer.sanitize(/ ... /);





If you want comments to be removed everywhere except for specific tags, then you can specify it like this:

javascript

import PurifyHTML from 'purify-html';



const rules = ['#comments', { name: 'div', dontRemoveComments: true }];

const sanitizer = new PurifyHTML(rules);



sanitizer.sanitize(/ ... /);





node-js



When used in an environment where the standard DOMParser is absent, you need to install a parser manually.



For example:

js

import { JSDOM } from 'jsdom';

global.DOMParser = new JSDOM().window.DOMParser;



import PurifyHTML from 'purify-html';

const sanitizer = new PurifyHTML(); // works

Or

js

import { JSDOM } from 'jsdom';

import PurifyHTML, { setParser } from 'purify-html';



// Scope elem variable, reuse DOMParser instance for performance

{

  const elem: Element = new DOMParser()

    .parseFromString('', 'text/html')

    .querySelector('body');



  // Set methods

  setParser({

    parse(string: string): Element {

      elem.innerHTML = string;



      return elem;

    },

    stringify(elem: Element): string {

      return elem.innerHTML;

    },

  });

}





setParser



In some cases, you may want to be able to use your parser instead of DOMParser.



This can be done like this:

javascript

import PurifyHTML, { setParser } from 'purify-html';



setParser({

  parse(HTMLString: string): HTMLElement {

    // ...

  },

  stringify(element: HTMLElement): string {

    // ...

  },

});



const sanitizer = new PurifyHTML();

// ...





Note! The root element will be passed to the

stringify

 function, and the CONTENT of the element will be expected as a result.

javascript

const input = document.createElement('div');



input.innerHTML = 'span';



stringify(input); // 'span' => OK

stringify(input); // 'span' => WRONG

---

`Why not` document.createElement(...)





Because by processing the string like this:

javascript

const parse = str => {

  const node = document.createElement('div');

  div.innerHTML = str;



  return div;

};





In fact, this function, having received a special payload, will RUN it. The following payload will send a network request:





And in the case of using DOMParser, the code does not run.



---



Attributes checks preset list



-

%correct-link%

 - only currect link

-

%http-link%

 - only http link

-

%https-link%

 - only https link

-

%ftp-link%

 - only ftp link

-

%https-link-without-search-params%

 - delete all search params and force https protocol

-

%http-link-without-search-params%

 - delete all search params and force https protocol

-

%same-https-origin% - only link that lead to the same origin that is currently in self.location.origin

. + force https protocol

-

%same-http-origin% - only link that lead to the same origin that is currently in self.location.origin`. + force http protocol

---

purify-html

purify-html

Install

API

$3

$3

$3

Examples

$3

HTML comments

$3

$3

node-js

setParser

Why not document.createElement(...)

Attributes checks preset list

Browser support

purify-html

purify-html

Install

API

$3

$3

$3

Examples

$3

HTML comments

$3

$3

node-js

setParser

Why not document.createElement(...)

Attributes checks preset list

Browser support

`Why not` document.createElement(...)

`Why not` document.createElement(...)