A file compression tool that uses Huffman coding
npm install filesqueeze
.txt, .json, .docx, and .pdf. It uses the Huffman coding technique to reduce the size of the text by encoding characters based on their frequency of occurrence in the source data.
.txt
.json
.docx
.pdf (with consideration that only the text is compressed; embedded images are not compressed)
npm or yarn for managing packages
bash
git clone https://github.com/HUMBLEF0OL/file-squeeze.git
`
2. Navigate to the project directory:
`bash
cd file-squeeze
`
3. Install the required dependencies:
`bash
npm install
`
Usage
$3
1. Compress a file:
Use the filesqueeze command with the compress option to compress a file.
`bash
filesqueeze compress [--output ]
`
- : The file to be compressed (e.g., sample.txt).
- [--output : The directory to store the compressed files (defaults to ./output).
2. Decompress a file:
To decompress a previously compressed file, use the decompress command.
`bash
filesqueeze decompress [--output ]
`
- : The directory containing the compressed file (encoded.bin and metaData.bin).
- [--output : The directory to store the decompressed files (defaults to ./output).
$3
The project generates a compression report for each file processed. The report includes:
- Original File Size: Size of the file before compression.
- Compressed File Size: Size of the file after compression.
- Compression Ratio: The ratio of the original file size to the compressed file size.
- Time Taken: Time spent to process and compress the file.
You can view the results in the console after the compression completes.
Compression Algorithm Overview
$3
- The algorithm starts by analyzing the frequency of each character in the input file.
$3
- A priority queue (min-heap) is built using the frequency data. This queue ensures that the least frequent characters are processed first.
$3
- The Huffman tree is built by combining nodes based on their frequencies. The two nodes with the least frequency are merged into a parent node, and this process is repeated until only one node (the root) remains.
$3
- Once the tree is built, binary codes are assigned to each character based on its position in the tree. Characters closer to the root get shorter codes, ensuring optimal compression.
$3
- The Huffman tree is serialized and saved in binary format for use in decompression.
$3
- The input text is encoded using the generated Huffman codes. Both the compressed data and metadata (Huffman tree) are saved into files.
$3
- The decompression process reads the serialized Huffman tree and decodes the compressed data back into its original form.
Example
$3
`txt
hello world
`
$3
- The file will be compressed into a binary file (encoded.bin), and metadata will be saved in a separate file (metaData.bin).
Metrics Example
$3
- Original File Size: 90 KB
- Compressed File Size: 48 KB
- Compression Ratio: 1.875 (compressed size / original size)
Contributing
If you'd like to contribute to this project, feel free to open a pull request. For bug reports or suggestions, please create an issue in the GitHub repository.
License
This project is licensed under the MIT License.
Acknowledgments
- The core compression algorithm is based on the Huffman coding technique. You can read more about it here: Huffman coding - Wikipedia.
- Special thanks to libraries like pdf-lib and pdf-parse` for PDF text extraction and manipulation.