image deduper cli tool: use perceptual hashing (phash) to deduplicate a directory of images
npm install idedupebox> Image deduplication cli tool
idedupebox is a simple tool that uses perceptual hashing (phash) for deduplicating images recursively in a directory, using the library sharp-phash.
Full parallelisation is on the todo list, but the tool is reasonably efficient as it is (tested on ~166K images, which were handled in ~half an hour or so on a Ryzen 7900). If this is a deal breaker for you, please get in touch and I'll add better parallelisation (or even better, PR?).
npm (bundled by default)npm:``bash`
npm install -g idedupebox
...you may need run the above command with sudo.
This will expose a command idedupebox.
`bash`
git clone https://codeberg.org/sbrl/idedupebox.git;
cd idedupebox;
Then, install dependencies:
`bash`
npm install
Now follow the getting started instructions below, replacing idedupebox with src/index.mjs - don't forget to be cded into the repository's directory.
1. dedupe: Walks a directory recursively, hashing all images as it goes. Spits out a list of deduplicated clusters in a .jsonl file.visualise
2. : Uses the .jsonl file from dedupe to create a subdirectory that hard-links all the clusters together into 1 folder for manual reviewdelete
3. : Deletes duplicates, leaving 1 image left per cluster (careful to have a backup!)
These subcommands should be used in this order.
To get detailed help, run this command:
`bash`
idedupebox --help
Generate a duplicates file for a directory:
`bash`
idedupebox dedupe --dirpath path/to/dir --output /tmp/x/20301120-duplicates.jsonl
Visualise an existing duplicates file:
`bash`
idedupebox visualise --verbose --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl
Backup a directory:
`bash`
tar -caf 20301120-backup.tar.gz path/to/same_dir_as_above
Dry-run a deletion of duplicates:
`bash`
idedupebox delete --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl
(note: add --force to actually delete the duplicates)
> [!NOTE]
> A built-in check ensures that the last file existing on disk in each cluster is never deleted. In case a deletion is required, which file in a given duplicates cluster that is deleted is undefined. The candidates are shuffled with fisher-yates¹² algorithm
, there are a number of possible output formats. Their names (section headings) and example output structures are given below.$3
`jsonl
{ "id": number, "filepaths": string[] }
...
`$3
`tsv
filepath cluster phash
path/to/cat.jpg 0 base64_here
...
`Contributing
Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the AGPL-3.0 (see below).Licence
idedupebox is released under the GNU Affero General Public License 3.0. The full license text is included in the LICENSE` file in this repository. Tldr legal have a great summary of the license if you're interested.