Document processing library with Google Cloud Vision OCR, text extraction from PDFs and images, and document parsing utilities. Includes text normalization and base parser framework.
npm install @egintegrations/document-servicesDocument processing library with Google Cloud Vision OCR for text extraction from PDFs and images, plus text processing utilities.
``bash`
npm install @egintegrations/document-services
`bash`For Google Cloud Vision OCR
npm install @google-cloud/vision
- Google Cloud Vision OCR: Extract text from images and PDFs
- Text Processing: Normalize whitespace, clean OCR artifacts, extract dates/amounts
- TypeScript: Full type safety
`typescript
import { GoogleVisionOCR } from '@egintegrations/document-services';
const ocr = new GoogleVisionOCR({
credentials: {
client_email: process.env.GCP_CLIENT_EMAIL,
private_key: process.env.GCP_PRIVATE_KEY,
},
projectId: process.env.GCP_PROJECT_ID,
});
// Extract text from image
const result = await ocr.extractText({
data: imageBuffer,
mimeType: 'image/jpeg',
filename: 'receipt.jpg',
});
if (result.success) {
console.log(result.extractedText);
}
// Extract text from PDF
const pdfResult = await ocr.extractText({
data: pdfBuffer,
mimeType: 'application/pdf',
filename: 'invoice.pdf',
});
`
`typescript
import {
cleanOCRText,
extractAmounts,
extractDates,
extractLines,
} from '@egintegrations/document-services';
const rawText = 'Total: $ 100 . 00 Date: 01 / 15 / 2023';
// Clean OCR artifacts
const cleaned = cleanOCRText(rawText);
// "Total: $100.00 Date: 01/15/2023"
// Extract amounts
const amounts = extractAmounts(cleaned);
// [100.00]
// Extract dates
const dates = extractDates(cleaned);
// ['01/15/2023']
// Extract lines
const lines = extractLines(text);
`
`typescript
interface OCRConfig {
credentials?: {
client_email: string;
private_key: string;
};
projectId?: string;
}
class GoogleVisionOCR {
constructor(config: OCRConfig);
extractText(document: DocumentInput): Promise
}
`
- normalizeWhitespace(text: string): string - Normalize whitespacecleanOCRText(text: string): string
- - Clean OCR artifactsextractLines(text: string): string[]
- - Extract non-empty linesextractAmounts(text: string): number[]
- - Extract currency amountsextractDates(text: string): string[]
- - Extract dates (MM/DD/YYYY)
For Google Cloud Vision:
- GCP_CLIENT_EMAIL - Service account emailGCP_PRIVATE_KEY
- - Service account private keyGCP_PROJECT_ID` - Google Cloud project ID
-
MIT
Extracted from BRS-Inbox-Scanner with Google Cloud Vision OCR integration.