node-tika #

Provides text extraction, metadata extraction, mime-type detection, text-encoding detection and language
detection. All via a native Java bridge with the Apache Tika content-analysis toolkit. Bundles Tika
1.13.

![Build Status](https://travis-ci.org/ICIJ/node-tika) ![npm version](https://badge.fury.io/js/tika)

Depends on node-java, which itself requires the JDK and Python 2 (not 3) to compile.

Requires JDK 7. Run node version to check the version that node-java is using. If the wrong version is
reported even if you installed JDK 1.7, make sure JAVA_HOME is set to the correct path then delete node_modules/java and rerun npm install.

Extracting text ##

``javascript var tika = require('tika');

var options = {

// Hint the content-type. This is optional but would help Tika choose a parser in some cases. contentType: 'application/pdf' };

tika.text('test/data/file.pdf', options, function(err, text) { console.log(text); });`

We can even extract directly from the Web. If the server returns a content-type header, it will be passed to Tika as a hint.

`javascript tika.text('http://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf', function(err, text) { // ... });`

Or extract text using OCR (requires Tesseract).

`javascript tika.text('test/data/ocr/simple.jpg', { ocrLanguage: 'eng' }, function(err, text) { // ... });`

`API ##`

All methods that accept a uri parameter accept relative or absolute file paths and http:, https: or ftp: URLs.

The available options are the following.

- contentTypeto provide a hint to Tika on which parser to use. -outputEncodingto specify the text output encoding. Defaults to UTF-8. -passwordto set a password to be used for encrypted files. -maxLength to specify a max number of character to extract.

`$3`

- ocrLanguageto set the language used by Tesseract. This option is required to enable OCR. -ocrPathto set the path to the Tesseract binaries. -ocrMaxFileSizeto set maximum file size in bytes to submit to OCR. -ocrMinFileSizeto set minimum file size in bytes to submit to OCR. -ocrPageSegmentationModeto set the Tesseract page segmentation mode. -ocrTimeout to set the maximum time in seconds to wait for the Tesseract process to terminate.

`$3`

- pdfAverageCharTolerance see PDFTextStripper.setAverageCharTolerance(float). -pdfEnableAutoSpace to set whether the parser should estimate where spaces should be inserted between words (trueby default). -pdfExtractAcroFormContent to set whether content should be extracted from AcroForms at the end of the document (trueby default). -pdfExtractAnnotationText to set whether to extract text from annotations (trueby default). -pdfExtractInlineImages to set whether to extract inline embedded OBX images (trueby default). -pdfExtractUniqueInlineImagesOnlyas multiple pages within a PDF file might refer to the same underlying image. -pdfSortByPositionto set whether to sort text tokens by their x/y position before extracting text. -pdfSpacingTolerance see PDFTextStripper.setSpacingTolerance(float). -pdfSuppressDuplicateOverlappingText to set whether the parse should try to remove duplicated text over the same region.

`$3`

Extract both text and metadata from a file.

`javascript tika.extract('test/data/file.pdf', function(err, text, meta) { console.log(text); // Logs 'Just some text'. console.log(meta.producer[0]); // Logs 'LibreOffice 4.1'. });`

`$3`

Extract text from a file.

`javascript tika.text('test/data/file.pdf', function(err, text) { console.log(text); });`

`$3`

Get an XHTML representation of the text extracted from a file.

`javascript tika.xhtml('test/data/file.pdf', function(err, xhtml) { console.log(xhtml); });`

`$3`

Extract metadata from a file. Returns an object with names as keys and arrays as values.

`javascript tika.meta('test/data/file.pdf', function(err, meta) { console.log(meta.producer[0]); // Logs 'LibreOffice 4.1'. });`

`$3`

Detect the content-type (MIME type) of a file.

`javascript tika.type('test/data/file.pdf', function(err, contentType) { console.log(contentType); // Logs 'application/pdf'. });`

`$3`

Detect the character set (text encoding) of a file.

`javascript tika.charset('test/data/file.txt', function(err, charset) { console.log(charset); // Logs 'ISO-8859-1'. });`

`$3`

Detect the content-type and character set of a file.

The character set will be appended to the mime-type if available.

`javascript tika.typeAndCharset('test/data/file.txt', function(err, typeAndCharset) { console.log(typeAndCharset); // Logs 'text/plain; charset=ISO-8859-1'. });`

`$3`

Detect the language a given string is written in.

`javascript tika.language('This is just some text in English.', function(err, language, reasonablyCertain) { console.log(language); // Logs 'en'. console.log(reasonablyCertain); // Logs true or false. });`

`Credits and collaboration ##`

Developed by Matthew Caruana Galizia at the ICIJ.

Please feel free to submit an issue or pull request. Don't forget to add your name to the CONTRIBUTORS file.

`License ##`

Apache Tika JAR distributed under the Apache License, Version 2.0.

node-tika #

![Build Status](https://travis-ci.org/ICIJ/node-tika) ![npm version](https://badge.fury.io/js/tika)

Depends on node-java, which itself requires the JDK and Python 2 (not 3) to compile.

Extracting text ##

``javascript var tika = require('tika');

var options = {

// Hint the content-type. This is optional but would help Tika choose a parser in some cases. contentType: 'application/pdf' };

tika.text('test/data/file.pdf', options, function(err, text) { console.log(text); });`

We can even extract directly from the Web. If the server returns a content-type header, it will be passed to Tika as a hint.

`javascript tika.text('http://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf', function(err, text) { // ... });`

Or extract text using OCR (requires Tesseract).

`javascript tika.text('test/data/ocr/simple.jpg', { ocrLanguage: 'eng' }, function(err, text) { // ... });`

`API ##`

All methods that accept a uri parameter accept relative or absolute file paths and http:, https: or ftp: URLs.

The available options are the following.

`$3`

Extract both text and metadata from a file.

`javascript tika.extract('test/data/file.pdf', function(err, text, meta) { console.log(text); // Logs 'Just some text'. console.log(meta.producer[0]); // Logs 'LibreOffice 4.1'. });`

`$3`

Extract text from a file.

`javascript tika.text('test/data/file.pdf', function(err, text) { console.log(text); });`

`$3`

Get an XHTML representation of the text extracted from a file.

`javascript tika.xhtml('test/data/file.pdf', function(err, xhtml) { console.log(xhtml); });`

`$3`

Extract metadata from a file. Returns an object with names as keys and arrays as values.

`javascript tika.meta('test/data/file.pdf', function(err, meta) { console.log(meta.producer[0]); // Logs 'LibreOffice 4.1'. });`

`$3`

Detect the content-type (MIME type) of a file.

`javascript tika.type('test/data/file.pdf', function(err, contentType) { console.log(contentType); // Logs 'application/pdf'. });`

`$3`

Detect the character set (text encoding) of a file.

`javascript tika.charset('test/data/file.txt', function(err, charset) { console.log(charset); // Logs 'ISO-8859-1'. });`

`$3`

Detect the content-type and character set of a file.

The character set will be appended to the mime-type if available.

`javascript tika.typeAndCharset('test/data/file.txt', function(err, typeAndCharset) { console.log(typeAndCharset); // Logs 'text/plain; charset=ISO-8859-1'. });`

`$3`

Detect the language a given string is written in.

`Credits and collaboration ##`

Developed by Matthew Caruana Galizia at the ICIJ.

Please feel free to submit an issue or pull request. Don't forget to add your name to the CONTRIBUTORS file.

`License ##`

Apache Tika JAR distributed under the Apache License, Version 2.0.