PDF Text Extract

Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext command to perform the actual extraction

![Build Status](https://travis-ci.org/nisaacson/pdf-text-extract) ![Dependency Status](https://david-dm.org/nisaacson/pdf-text-extract)

Installation

bash
npm install --save pdf-text-extract

You will need thepdftotext binary available on your path. There are packages available for many different operating systems

See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext command

`Usage`

`As a module`

extract(filePath, [options], [pdftotextcommand], callback)

Options and pdftotextcommand are not required.

`javascript var path = require('path') var filePath = path.join(__dirname, 'test/data/multipage.pdf') var extract = require('pdf-text-extract') extract(filePath, function (err, pages) { if (err) { console.dir(err) return } console.dir(pages) })`The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option tosplitPages: false.

`javascript var filePath = path.join(__dirname, 'test/data/multipage.pdf') var extract = require('pdf-text-extract') extract(filePath, { splitPages: false }, function (err, text) { if (err) { console.dir(err) return } console.dir(text) })`

You can set the following options: -firstPage: First page to extract -lastPage: Last page to extract -resolution: in dpi, as is specified by pdftotext -r -crop: Should be an object { x:x, y:y, w:w, h:h } -layout: Should be either layout, raw or htmlmeta. Default: layout-encoding: Should be either UCS-2, ASCII7, Latin1, UTF-8, ZapfDingbats or Symbol. Default: UTF-8-eol: End of line convention. One of either: unix, dos or mac-ownerPassword: Owner password (for encrypted files) -userPassword: User password (for encrypted files) -splitPages: If true, the result will be and array of pages. Default: true.

If needed you can pass an optional arguments to the extract function. These will be passed to thechild_process.spawn call.

`javascript var filePath = path.join(__dirname, 'test/data/multipage.pdf') var extract = require('pdf-text-extract') var options = { cwd: "./" } extract(filePath, options, function (err, pages) { if (err) { console.dir(err) return } console.dir('extracted pages', pages) })`

You can also override the command for pdftotext if it is installed in a location that is not available in the PATH environment variable

`javascript var filePath = path.join(__dirname, 'test/data/multipage.pdf') var pdfToTextCommand = '/opt/bin/pdftotext' var extract = require('pdf-text-extract') var options = { cwd: "./" } extract(filePath, options, pdfToTextCommand, function (err, pages) { if (err) { console.dir(err) return } console.dir('extracted pages', pages) })`

`As a command line tool`

`bash npm install -g pdf-text-extract`

Execute with the filePath as an argument. Output will be json-formatted array of pages

`bash pdf-text-extract ./test/data/multipage.pdf

`outputs`


['', '']


Test

``bash

install dev dependencies

npm install

run tests

npm test

As a module

extract(filePath, [options], [pdftotextcommand], callback)

Options and pdftotextcommand are not required.

javascript
var path = require('path')
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, function (err, pages) {
  if (err) {
    console.dir(err)
    return
  }
  console.dir(pages)
})


The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option to

splitPages: false.

javascript
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, { splitPages: false }, function (err, text) {
  if (err) {
    console.dir(err)
    return
  }
  console.dir(text)
})

You can set the following options:
-

firstPage

: First page to extract
-

lastPage

: Last page to extract
-

resolution

: in dpi, as is specified by pdftotext -r
-

crop

: Should be an object { x:x, y:y, w:w, h:h }
-

layout: Should be either layout, raw or htmlmeta. Default: layout

encoding: Should be either UCS-2, ASCII7, Latin1, UTF-8, ZapfDingbats or Symbol. Default: UTF-8

eol: End of line convention. One of either: unix, dos or mac

ownerPassword

: Owner password (for encrypted files)
-

userPassword

: User password (for encrypted files)
-

splitPages: If true, the result will be and array of pages. Default: true.


If needed you can pass an optional arguments to the extract function. These will be passed to the

child_process.spawn call.

javascript
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
var options = {
  cwd: "./"
}
extract(filePath, options, function (err, pages) {
  if (err) {
    console.dir(err)
    return
  }
  console.dir('extracted pages', pages)
})

You can also override the command for pdftotext if it is installed in a location that is not available in the PATH environment variable

javascript
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var pdfToTextCommand = '/opt/bin/pdftotext'
var extract = require('pdf-text-extract')
var options = {
  cwd: "./"
}
extract(filePath, options, pdfToTextCommand, function (err, pages) {
  if (err) {
    console.dir(err)
    return
  }
  console.dir('extracted pages', pages)
})