npm install pdftojson

pdftojson
=========

![Build Status](https://travis-ci.org/MrOrz/pdftojson) ![Coverage Status](https://coveralls.io/github/MrOrz/pdftojson?branch=master)

pdftojson is a pdftotext wrapper that generates JSON with bounding box data. It takes care of overlapping duplicated characters, which often exists in MS-Word-generated PDF files with floating images and text.

Why bother a wrapper for pdftotext?
------------------------------

Consider this PDF file:

!PDF sample

pdftotext -bbox theFile.pdf would generate this:

``html ... (6)綠線 G01 G 站延伸伸至大溪、龍潭先進進公共運輸輸系統發展展委託可行行性研究 ...`

pdftotext does a great job "undoing" physical layout (columns, hyphenation, etc) of a PDF document. However, in its result there are some overlapping and duplicate words. PDF layout engines sometimes generate these quirks when images and text are mixed within a page.

On the other hand, pdftojson theFile.pdf could generate this:

`js ... { "xMin": 103.2, "xMax": 348.29439, "yMin": 547.3557, "yMax": 561.32172, "text": "(6)綠線 G01 站延伸至大溪、龍潭先進公" }, { "xMin": 124.68, "xMax": 320.813062, "yMin": 572.3757, "yMax": 586.34172, "text": "共運輸系統發展委託可行性研究" } ...`

Install -------

`$ npm install pdftojson`

pdftojson uses pdftotext. Please make sure pdftotext is available in PATH.

Usage -----

pdftojson is available as a command line tool and a nodejs library.

`$3`

`outputs some.json`


$ pdftojson some.pdf
converts page 3 ~ 6 of some.pdf and outputs to some.json

$ pdftojson -c "-f 3 -l 6" some.pdf


$3
The library exposes a single function that takes the name of a PDF file
and returns a promise.

`js import pdftojson from 'pdftojson';

pdftojson("./some.pdf").then((output) => { // output is a Javascript object. });`

`$3`

All numeric values are in pt.

`js [ { //: Page width: (Number) page width, height: (Number) page height, words: [ { text: (String) the text enclosed in the bounding box,

// All coordinates calculated from top-left corner of the page xMin: (Number) left edge of the bounding box, xMax: (Number) right edge of the bounding box, yMin: (Number) top edge of the bounding box, yMax: (Number) bottom edge of the bounding box }, // ... ] }, // ... ]``

npm install pdftojson

pdftojson
=========

![Build Status](https://travis-ci.org/MrOrz/pdftojson) ![Coverage Status](https://coveralls.io/github/MrOrz/pdftojson?branch=master)

Why bother a wrapper for pdftotext?
------------------------------

Consider this PDF file:

!PDF sample

pdftotext -bbox theFile.pdf would generate this:

``html ... (6)綠線 G01 G 站延伸伸至大溪、龍潭先進進公共運輸輸系統發展展委託可行行性研究 ...`

On the other hand, pdftojson theFile.pdf could generate this:

Install -------

`$ npm install pdftojson`

pdftojson uses pdftotext. Please make sure pdftotext is available in PATH.

Usage -----

pdftojson is available as a command line tool and a nodejs library.

`$3`

`outputs some.json`


$ pdftojson some.pdf
converts page 3 ~ 6 of some.pdf and outputs to some.json

$ pdftojson -c "-f 3 -l 6" some.pdf


$3
The library exposes a single function that takes the name of a PDF file
and returns a promise.

`js import pdftojson from 'pdftojson';

pdftojson("./some.pdf").then((output) => { // output is a Javascript object. });`

`$3`

All numeric values are in pt.

`js [ { //: Page width: (Number) page width, height: (Number) page height, words: [ { text: (String) the text enclosed in the bounding box,