pdftotext wrapper that generates JSON with bounding box data. Takes care of duplicate characters.
npm install pdftojsonpdftojson
=========
 
pdftojson is a pdftotext wrapper that generates JSON with bounding box data. It takes care of overlapping duplicated characters, which often exists in MS-Word-generated PDF files with floating images and text.
Why bother a wrapper for pdftotext?
------------------------------
Consider this PDF file:
pdftotext -bbox theFile.pdf would generate this:
``html`
...
...
pdftotext does a great job "undoing" physical layout (columns, hyphenation, etc) of a PDF document. However, in its result there are some overlapping and duplicate words. PDF layout engines sometimes generate these quirks when images and text are mixed within a page.
On the other hand, pdftojson theFile.pdf could generate this:
`js`
...
{
"xMin": 103.2,
"xMax": 348.29439,
"yMin": 547.3557,
"yMax": 561.32172,
"text": "(6)綠線 G01 站延伸至大溪、龍潭先進公"
},
{
"xMin": 124.68,
"xMax": 320.813062,
"yMin": 572.3757,
"yMax": 586.34172,
"text": "共運輸系統發展委託可行性研究"
}
...
Install
-------
``
$ npm install pdftojson
pdftojson uses pdftotext. Please make sure pdftotext is available in PATH.
Usage
-----
pdftojson is available as a command line tool and a nodejs library.
`outputs some.json
$ pdftojson some.pdf
$3
The library exposes a single function that takes the name of a PDF file
and returns a promise.
`js
import pdftojson from 'pdftojson';pdftojson("./some.pdf").then((output) => {
// output is a Javascript object.
});
`$3
All numeric values are in
pt.`js
[
{ //: Page
width: (Number) page width,
height: (Number) page height,
words: [
{
text: (String) the text enclosed in the bounding box, // All coordinates calculated from top-left corner of the page
xMin: (Number) left edge of the bounding box,
xMax: (Number) right edge of the bounding box,
yMin: (Number) top edge of the bounding box,
yMax: (Number) bottom edge of the bounding box
}, // ...
]
}, // ...
]
``