idedupebox

> Image deduplication cli tool

idedupebox is a simple tool that uses perceptual hashing (phash) for deduplicating images recursively in a directory, using the library sharp-phash.

Full parallelisation is on the todo list, but the tool is reasonably efficient as it is (tested on ~166K images, which were handled in ~half an hour or so on a Ryzen 7900). If this is a deal breaker for you, please get in touch and I'll add better parallelisation (or even better, PR?).

System requirements

- Node.js + npm (bundled by default)
- Some command-line knowledge
- Some images in a directory to deduplicates

Installation

Install this package via npm:

``bash npm install -g idedupebox`

...you may need run the above command with sudo.

This will expose a command idedupebox.

`$3`


If you'd rather install from source, do so by cloning this directory:

`bash git clone https://codeberg.org/sbrl/idedupebox.git; cd idedupebox;`

Then, install dependencies:

`bash npm install`

Now follow the getting started instructions below, replacing idedupebox with src/index.mjs - don't forget to be cded into the repository's directory.

`Getting started`


This tool has 3 subcommands:

1. dedupe: Walks a directory recursively, hashing all images as it goes. Spits out a list of deduplicated clusters in a .jsonl file. 2.visualise: Uses the .jsonl file from dedupeto create a subdirectory that hard-links all the clusters together into 1 folder for manual review 3.delete: Deletes duplicates, leaving 1 image left per cluster (careful to have a backup!)

These subcommands should be used in this order.

To get detailed help, run this command:

`bash idedupebox --help`

`$3`


Some example command invocations are shown below.
Generate a duplicates file for a directory:

`bash idedupebox dedupe --dirpath path/to/dir --output /tmp/x/20301120-duplicates.jsonl`

Visualise an existing duplicates file:

`bash idedupebox visualise --verbose --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl`

Backup a directory:

`bash tar -caf 20301120-backup.tar.gz path/to/same_dir_as_above`

Dry-run a deletion of duplicates:

`bash idedupebox delete --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl`

(note: add --force to actually delete the duplicates)

> [!NOTE] > A built-in check ensures that the last file existing on disk in each cluster is never deleted. In case a deletion is required, which file in a given duplicates cluster that is deleted is undefined. The candidates are shuffled with fisher-yates¹² algorithm

`Output format`


Aside from

ascii

, there are a number of possible output formats. Their names (section headings) and example output structures are given below.
$3

jsonl
{ "id": number, "filepaths": string[] }
...

$3

tsv
filepath	cluster	phash
path/to/cat.jpg	0	base64_here
...


Contributing

Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the AGPL-3.0 (see below).
Licence

idedupebox is released under the GNU Affero General Public License 3.0. The full license text is included in the

LICENSE` file in this repository. Tldr legal have a great summary of the license if you're interested.

idedupebox

> Image deduplication cli tool

idedupebox is a simple tool that uses perceptual hashing (phash) for deduplicating images recursively in a directory, using the library sharp-phash.

System requirements

- Node.js + npm (bundled by default)
- Some command-line knowledge
- Some images in a directory to deduplicates

Installation

Install this package via npm:

``bash npm install -g idedupebox`

...you may need run the above command with sudo.

This will expose a command idedupebox.

`$3`


If you'd rather install from source, do so by cloning this directory:

`bash git clone https://codeberg.org/sbrl/idedupebox.git; cd idedupebox;`

Then, install dependencies:

`bash npm install`

Now follow the getting started instructions below, replacing idedupebox with src/index.mjs - don't forget to be cded into the repository's directory.

`Getting started`


This tool has 3 subcommands:

These subcommands should be used in this order.

To get detailed help, run this command:

`bash idedupebox --help`

`$3`


Some example command invocations are shown below.
Generate a duplicates file for a directory:

`bash idedupebox dedupe --dirpath path/to/dir --output /tmp/x/20301120-duplicates.jsonl`

Visualise an existing duplicates file:

`bash idedupebox visualise --verbose --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl`

Backup a directory:

`bash tar -caf 20301120-backup.tar.gz path/to/same_dir_as_above`

Dry-run a deletion of duplicates:

`bash idedupebox delete --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl`

(note: add --force to actually delete the duplicates)

`Output format`


Aside from

ascii

, there are a number of possible output formats. Their names (section headings) and example output structures are given below.
$3

jsonl
{ "id": number, "filepaths": string[] }
...

$3

tsv
filepath	cluster	phash
path/to/cat.jpg	0	base64_here
...


Contributing

Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the AGPL-3.0 (see below).
Licence

idedupebox is released under the GNU Affero General Public License 3.0. The full license text is included in the

LICENSE` file in this repository. Tldr legal have a great summary of the license if you're interested.