OCR Document Classification

Overview

The OCR Document Classification package provides a utility to classify documents based on their content. It uses OCR (Optical Character Recognition) to extract text from images and then determines the document type by matching extracted words with predefined target words using string similarity.

Installation

To install this package, use npm:

``bash npm install ocr-document-classification`

`Usage`

The main function exported by this package is classifyDocument. Below is a detailed guide on how to use it.

`$3`

`javascript import { classifyDocument } from "ocr-document-classification"; import type { documentDictionary } from "ocr-document-classification";`

`$3`

#### Parameters

- file: The image file (File object) of the document to be classified. - options (optional): An object containing the following optional properties: - onProgress: A callback function to receive progress updates. It accepts a number between 0 and 100. - customDocumentDictionary: An object containing custom document types and their associated target words. - maxNumPages: A number specifying the maximum number of pages to process. Defaults to Infinity.

#### Returns

A Promise that resolves with an object containing:

- classification: The determined document type. -text: The extracted text from the document.

`$3`


There exists a couple of default classes that can be useful to classify the most common documents. As you can see there exists multiple arrays for each key. This means that every word of only ONE of the arrays needs to be found in the document after OCR. You can also add your own class my creating a customDocumentDictionary.

javascript
const defaultDocumentDictionary: documentDictionary = {
  MILITÆRBEVIS: [
    ["førstegangstjeneste", "bevis", "avtjent"],
    ["attest", "førstegangstjeneste"],
    ["fullført", "førstegangstjeneste"],
  ],
  POLITIATTEST: [["politiattest", "politidistrikt"], ["police certificate"]],
  KOMPETANSEBEVIS: [["omfatter", "opplæring", "utdanningsprogram"]],
  LEGEERKLÆRING: [["legeerklæring", "fødselsnummer"]],
  BOSTEDSATTEST: [
    ["registrerte", "opplysninger", "folkeregisteret"],
    ["bostedsattest", "bostedsadresse", "registrert"],
    ["registrert", "adressehistorikk", "folkeregisteret"],
  ],
};


$3
Here is an example of how to use the package can be used with a custom document dictionary in React:

`jsx import React, { useState, useEffect } from "react"; import { classifyDocument } from "ocr-document-classification";

function UploadClassification() { const [documentFile, setDocumentFile] = useState(null); const [classification, setClassification] = useState(""); const [outputText, setOutputText] = useState(""); const [progress, setProgress] = useState(0);

const handleFileChange = (event: React.ChangeEvent) => { const file = event.target.files && event.target.files[0]; setDocumentFile(file); };

const customDocumentDictionary = { Jobbsøknad: [["søknad", "stilling", "ledig"]], };

useEffect(() => { console.log("Progress: ", progress); }, [progress]);

useEffect(() => { if (documentFile) { classifyDocument(documentFile, { onProgress: setProgress, customDocumentDictionary: customDocumentDictionary, }) .then(({ classification, text }) => { setClassification(classification); setOutputText(text); }) .catch((err) => { console.error(err); setOutputText("Error during OCR processing"); }); } resetOCR(); }, [documentFile]);

function resetOCR() { setClassification(""); setOutputText(""); setProgress(0); }

return ( <> accept="image/jpeg, image/png" type="file" onChange={handleFileChange} />


        Resultat av OCR

        {classification ? outputText : "Laster inn ..."}

        {classification}


    
  );
}
export default UploadClassification;

`Dependencies`

This package relies on the following dependencies:

- string-similarity-js: For calculating the similarity between strings. -tesseract.js: For performing OCR on the document image. -pdfjs-dist`: For handling PDFs

LICENSE

This package is currently UNLICENSED.

OCR Document Classification

Overview

Installation

To install this package, use npm:

``bash npm install ocr-document-classification`

`Usage`

The main function exported by this package is classifyDocument. Below is a detailed guide on how to use it.

`$3`

`javascript import { classifyDocument } from "ocr-document-classification"; import type { documentDictionary } from "ocr-document-classification";`

`$3`

#### Parameters

#### Returns

A Promise that resolves with an object containing:

- classification: The determined document type. -text: The extracted text from the document.

`$3`


There exists a couple of default classes that can be useful to classify the most common documents. As you can see there exists multiple arrays for each key. This means that every word of only ONE of the arrays needs to be found in the document after OCR. You can also add your own class my creating a customDocumentDictionary.

javascript
const defaultDocumentDictionary: documentDictionary = {
  MILITÆRBEVIS: [
    ["førstegangstjeneste", "bevis", "avtjent"],
    ["attest", "førstegangstjeneste"],
    ["fullført", "førstegangstjeneste"],
  ],
  POLITIATTEST: [["politiattest", "politidistrikt"], ["police certificate"]],
  KOMPETANSEBEVIS: [["omfatter", "opplæring", "utdanningsprogram"]],
  LEGEERKLÆRING: [["legeerklæring", "fødselsnummer"]],
  BOSTEDSATTEST: [
    ["registrerte", "opplysninger", "folkeregisteret"],
    ["bostedsattest", "bostedsadresse", "registrert"],
    ["registrert", "adressehistorikk", "folkeregisteret"],
  ],
};


$3
Here is an example of how to use the package can be used with a custom document dictionary in React:

`jsx import React, { useState, useEffect } from "react"; import { classifyDocument } from "ocr-document-classification";

const handleFileChange = (event: React.ChangeEvent) => { const file = event.target.files && event.target.files[0]; setDocumentFile(file); };

const customDocumentDictionary = { Jobbsøknad: [["søknad", "stilling", "ledig"]], };

useEffect(() => { console.log("Progress: ", progress); }, [progress]);

function resetOCR() { setClassification(""); setOutputText(""); setProgress(0); }

return ( <> accept="image/jpeg, image/png" type="file" onChange={handleFileChange} />


        Resultat av OCR

        {classification ? outputText : "Laster inn ..."}

        {classification}


    
  );
}
export default UploadClassification;

`Dependencies`

This package relies on the following dependencies:

- string-similarity-js: For calculating the similarity between strings. -tesseract.js: For performing OCR on the document image. -pdfjs-dist`: For handling PDFs

LICENSE

This package is currently UNLICENSED.