hyperlink

A command-line tool to find broken links in your static site.

* Fast. docs.sentry.io produces
1.1 GB of HTML files. hyperlink handles this amount of data in 4 seconds on
a MacBook Pro 2018. See Alternatives for a performance comparison.

* Pay for what you need. By default, hyperlink checks for hard 404s in
internal links only. Anything beyond that is opt-in. See Options
for a list of features to enable.

* Maps back errors to source files. If your static site was created from
Markdown files, hyperlink can try to find the original broken link by
fuzzy-matching the content around it. See the --sources option.

* Supports traversing file-system paths only, no arbitrary URLs. Hyperlink does not know how to make network calls.

However, hyperlink does have tools to extract external links.

* Does not honor robots.txt. A broken link is still broken for users even if
not indexed by Google.

* Does not parse CSS files, as broken links in CSS have not been a practical
concern for us. We are concerned about broken link in the page content, not
the chrome around it.

* Only supports UTF-8 encoded HTML files.

Installation and Usage

Download the latest binary and:

``bash

`Check a folder of HTML`


./hyperlink public/
Also validate anchors

./hyperlink public/ --check-anchors
src/ is a folder of Markdown. Show original Markdown file paths in errors

./hyperlink public/ --sources src/

$3

`yaml - uses: untitaker/hyperlink@0.2.0 with: args: public/ --sources src/`

`$3`

`bash npm install -g @untitaker/hyperlink hyperlink public/ --sources src/`

`$3`

`bash docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:0.2.0 /check/public/ --sources /check/src/

`specific commit`


docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:sha-82ca78c /check/public/ --sources /check/src


See all available tags
$3

`bash cargo install --locked hyperlink # latest stable release cargo install --locked --git https://github.com/untitaker/hyperlink # latest git SHA`

`Options`

When invoked without options, hyperlinkonly checks for 404s of internal links. However, it can do more.

* -j/--jobs: How many threads to spawn for parsing HTML. By defaulthyperlink will attempt to saturate your CPU.

* --check-anchors: Opt-in, check for validity of anchors on pages. Broken anchors are considered warnings, meaning thathyperlink will exit 2if there are only broken anchors but no hard 404s.

* --sources: A folder of markdown files that were the input for the HTMLhyperlinkhas to check. This is used to provide better error messages that point at the actual file to edit.hyperlinkdoes very simple content-based matching to figure out which markdown files may have been involved in the creation of a HTML file.

Why not just crawl and validate links in Markdown at this point? Answer:

* There are countless of proprietary extensions to markdown out there for creating intra-page links that are generally not supported by link checking tools.

* The structure of your markdown content does not necessarily match the structure of your HTML (i.e. what the user actually sees). With this setup,hyperlink does not have to assume anything about your build pipeline.

* --github-actions: Emit GitHub actions errors, i.e. add error messages in-line to PR diffs. This is only useful with--sources set.

If you are using hyperlinkthrough the GitHub action this option is already set. It is only useful if you are downloading/building and running hyperlink yourself in CI.

`Exit codes`

* exit 1: There have been errors (hard 404s) *exit 2: There have been only warnings (broken anchors)

`External links`

Hyperlink does know how to check external links, but it gives you some tools to extract them. Output is just the external URLs, separated by newline.

`hyperlink dump-external-links build/`

Output:

`http://example.com/myurl ...`

This allows you to plug in your own logic that fits the requirements for your site (special handling for social networks, custom URI schemes, ...):

`filter for HTTP URLs and turn off all link-checking for our social media`


handles, as twitter.com is unreliable and we already know those links are correct.

hyperlink dump-external-links build/ | \ rg '^https?://' | \ rg -v '^https://twitter.com/untitaker' | \ xargs -P20 -I{} bash -c 'curl -ILf "{}" &> /dev/null || (echo "{}" && exit 1)'`

...and allows hyperlink to focus on its main job of traversing and parsing HTML.

`Alternatives`

*(roughly ranked by performance, determined by some unserious benchmark. this section contains partially dated measurements and is not continuously updated with regards to either performance or featureset)*

None of the listed alternatives have an equivalent to hyperlink's --sourcesand--github-actions feature.

* lychee, like hyperlink, is a great choice for obscenely large static sites. Additionally it can check external/outbound links. An invocation oflychee --offline public/is more or less equivalent tohyperlink public/.

* liche seems to be fairly fast, but is unmaintained.

* htmltest seems to be fairly fast as well, and is more of a general-purpose HTML linting tool.

* muffet seems to have similar performance ashtmltest. We tested muffetwithhttp-serverand webfsd without noticing a change in timings.

* linkcheck is faster than linkcheckerbut still quite slow on large sites.

We tried linkchecktogether withhttp-serveron localhost, although that does not seem to be the bottleneck at all.

* wummel/linkchecker seems to be the fairly feature-rich, but was a non-starter due to performance. This applies to other countless link checkers we tried that are not mentioned here.

`Redirects`

Since 0.2.0 hyperlink supports reading configured redirects from a file.

At the root of your site, make a file _redirects:

`lines starting with # are ignored`


/old-url.html /new-url.html
on the next line, trailing data like the 301 status code is ignored

/old-url2.html /new-url2.html  301
/old-url.html will become a valid link target

hyperlink will validate that /new-url.html exists.


This format is supported by at least Netlify, Codeberg
pages and Grebedoc
References for this format can be found at
Codeberg and
Netlify.
The major things missing from the implementation are:

* hyperlinkcompletely ignores any status codes or country code conditions. The only thing it parses arefrom to, and the rest is ignored.

"Splat sources" (/articles/) and "splat targets" (/posts/:splat) are not supported.

* Generally speaking, hyperlinkdoes not support "pretty URLs", i.e. one cannot request/mypage and expect mypage.html to be loaded.

`Testimonials`

> We use Hyperlink to check for dead links on > Graphviz's static-site user documentation, because: > > Hyperlink is blazingly* fast, checking 700 HTML pages in 220ms (default) and > 850ms (with--check-anchors). > * Hyperlink's single-binary release, with no library dependencies, > was trivial to integrate into our continuous integration tests. > * High coverage: Hyperlink immediately spotted over a thousand broken page > links within bothtags and HTML redirects, and a further 62 broken > anchor-links with--check-anchors. > * Hyperlink's design decision to crawl only static files (avoiding HTTP), > avoids test flakiness from network requests, allowing me to confidently > block merging if Hyperlink reports an error. > > In conclusion, Hyperlink fills the "static site continuous testing" niche > really nicely.

-- Mark Hansen, Graphviz documentation maintainer

`License`

Licensed under the MIT, see ./LICENSE`.

hyperlink

A command-line tool to find broken links in your static site.

* Fast. docs.sentry.io produces
1.1 GB of HTML files. hyperlink handles this amount of data in 4 seconds on
a MacBook Pro 2018. See Alternatives for a performance comparison.

* Pay for what you need. By default, hyperlink checks for hard 404s in
internal links only. Anything beyond that is opt-in. See Options
for a list of features to enable.

* Supports traversing file-system paths only, no arbitrary URLs. Hyperlink does not know how to make network calls.

However, hyperlink does have tools to extract external links.

* Does not honor robots.txt. A broken link is still broken for users even if
not indexed by Google.

* Does not parse CSS files, as broken links in CSS have not been a practical
concern for us. We are concerned about broken link in the page content, not
the chrome around it.

* Only supports UTF-8 encoded HTML files.

Installation and Usage

Download the latest binary and:

``bash

`Check a folder of HTML`


./hyperlink public/
Also validate anchors

./hyperlink public/ --check-anchors
src/ is a folder of Markdown. Show original Markdown file paths in errors

./hyperlink public/ --sources src/

$3

`yaml - uses: untitaker/hyperlink@0.2.0 with: args: public/ --sources src/`

`$3`

`bash npm install -g @untitaker/hyperlink hyperlink public/ --sources src/`

`$3`

`bash docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:0.2.0 /check/public/ --sources /check/src/

`specific commit`


docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:sha-82ca78c /check/public/ --sources /check/src


See all available tags
$3

`bash cargo install --locked hyperlink # latest stable release cargo install --locked --git https://github.com/untitaker/hyperlink # latest git SHA`

`Options`

When invoked without options, hyperlinkonly checks for 404s of internal links. However, it can do more.

* -j/--jobs: How many threads to spawn for parsing HTML. By defaulthyperlink will attempt to saturate your CPU.

* --check-anchors: Opt-in, check for validity of anchors on pages. Broken anchors are considered warnings, meaning thathyperlink will exit 2if there are only broken anchors but no hard 404s.

Why not just crawl and validate links in Markdown at this point? Answer:

* There are countless of proprietary extensions to markdown out there for creating intra-page links that are generally not supported by link checking tools.

* --github-actions: Emit GitHub actions errors, i.e. add error messages in-line to PR diffs. This is only useful with--sources set.

If you are using hyperlinkthrough the GitHub action this option is already set. It is only useful if you are downloading/building and running hyperlink yourself in CI.

`Exit codes`

* exit 1: There have been errors (hard 404s) *exit 2: There have been only warnings (broken anchors)

`External links`

Hyperlink does know how to check external links, but it gives you some tools to extract them. Output is just the external URLs, separated by newline.

`hyperlink dump-external-links build/`

Output:

`http://example.com/myurl ...`

This allows you to plug in your own logic that fits the requirements for your site (special handling for social networks, custom URI schemes, ...):

`filter for HTTP URLs and turn off all link-checking for our social media`


handles, as twitter.com is unreliable and we already know those links are correct.

hyperlink dump-external-links build/ | \ rg '^https?://' | \ rg -v '^https://twitter.com/untitaker' | \ xargs -P20 -I{} bash -c 'curl -ILf "{}" &> /dev/null || (echo "{}" && exit 1)'`

...and allows hyperlink to focus on its main job of traversing and parsing HTML.

`Alternatives`

None of the listed alternatives have an equivalent to hyperlink's --sourcesand--github-actions feature.

* liche seems to be fairly fast, but is unmaintained.

* htmltest seems to be fairly fast as well, and is more of a general-purpose HTML linting tool.

* muffet seems to have similar performance ashtmltest. We tested muffetwithhttp-serverand webfsd without noticing a change in timings.

* linkcheck is faster than linkcheckerbut still quite slow on large sites.

We tried linkchecktogether withhttp-serveron localhost, although that does not seem to be the bottleneck at all.

* wummel/linkchecker seems to be the fairly feature-rich, but was a non-starter due to performance. This applies to other countless link checkers we tried that are not mentioned here.

`Redirects`

Since 0.2.0 hyperlink supports reading configured redirects from a file.

At the root of your site, make a file _redirects:

`lines starting with # are ignored`


/old-url.html /new-url.html
on the next line, trailing data like the 301 status code is ignored

/old-url2.html /new-url2.html  301
/old-url.html will become a valid link target

hyperlink will validate that /new-url.html exists.


This format is supported by at least Netlify, Codeberg
pages and Grebedoc
References for this format can be found at
Codeberg and
Netlify.
The major things missing from the implementation are:

* hyperlinkcompletely ignores any status codes or country code conditions. The only thing it parses arefrom to, and the rest is ignored.

"Splat sources" (/articles/) and "splat targets" (/posts/:splat) are not supported.

* Generally speaking, hyperlinkdoes not support "pretty URLs", i.e. one cannot request/mypage and expect mypage.html to be loaded.

`Testimonials`

-- Mark Hansen, Graphviz documentation maintainer

`License`

Licensed under the MIT, see ./LICENSE`.