fully asynchronous, pure JavaScript implementation of the Parquet file format
npm install parquetjs-decimal
$ npm install parquetjs-lite
`
_parquet.js requires node.js >= 7.6.0_
Usage: Writing files
--------------------
Once you have installed the parquet.js library, you can import it as a single
module:
` js
var parquet = require('parquetjs-lite');
`
Parquet files have a strict schema, similar to tables in a SQL database. So,
in order to produce a Parquet file we first need to declare a new schema. Here
is a simple example that shows how to instantiate a ParquetSchema object:
` js
// declare a schema for the fruits table
var schema = new parquet.ParquetSchema({
name: { type: 'UTF8' },
quantity: { type: 'INT64' },
price: { type: 'DOUBLE' },
date: { type: 'TIMESTAMP_MILLIS' },
in_stock: { type: 'BOOLEAN' }
});
`
Note that the Parquet schema supports nesting, so you can store complex, arbitrarily
nested records into a single row (more on that later) while still maintaining good
compression.
Once we have a schema, we can create a ParquetWriter object. The writer will
take input rows as JSON objects, convert them to the Parquet format and store
them on disk.
` js
// create new ParquetWriter that writes to 'fruits.parquet
Once we are finished adding rows to the file, we have to tell the writer object
to flush the metadata to disk and close the file by calling the close() method:
Usage: Reading files
--------------------
A parquet reader allows retrieving the rows from a parquet file in order.
The basic usage is to create a reader and then retrieve a cursor/iterator
which allows you to consume row after row until all rows have been read.
You may open more than one cursor and use them concurrently. All cursors become
invalid once close() is called on
the reader object.
` js
// create new ParquetReader that reads from 'fruits.parquet
When creating a cursor, you can optionally request that only a subset of the
columns should be read from disk. For example:
` js
// create a new cursor that will only return the name and price columns
let cursor = reader.getCursor(['name', 'price']);
`
It is important that you call close() after you are finished reading the file to
avoid leaking file descriptors.
` js
await reader.close();
`
$3
Parquet files can be read from a url without having to download the whole file.
You will have to supply the request library as a first argument and the request parameters
as a second argument to the function parquetReader.openUrl.
` js
const request = require('request');
let reader = await parquet.ParquetReader.openUrl(request,'https://domain/fruits.parquet');
`
$3
Parquet files can be read from an S3 object without having to download the whole file.
You will have to supply the aws-sdk client as first argument and the bucket/key information
as second argument to the function parquetReader.openS3.
` js
const AWS = require('aws-sdk');
const client = new AWS.S3({
accessKeyId: 'xxxxxxxxxxx',
secretAccessKey: 'xxxxxxxxxxx'
});
const params = {
Bucket: 'xxxxxxxxxxx',
Key: 'xxxxxxxxxxx'
};
let reader = await parquet.ParquetReader.openS3(client,params);
`
$3
If the complete parquet file is in buffer it can be read directly from memory without incurring any additional I/O.
` js
const file = fs.readFileSync('fruits.parquet');
let reader = await parquet.ParquetReader.openBuffer(file);
`
Encodings
---------
Internally, the Parquet format will store values from each field as consecutive
arrays which can be compressed/encoded using a number of schemes.
#### Plain Encoding (PLAIN)
The most simple encoding scheme is the PLAIN encoding. It simply stores the
values as they are without any compression. The PLAIN encoding is currently
the default for all types except BOOLEAN:
` js
var schema = new parquet.ParquetSchema({
name: { type: 'UTF8', encoding: 'PLAIN' },
});
`
#### Run Length Encoding (RLE)
The Parquet hybrid run length and bitpacking encoding allows to compress runs
of numbers very efficiently. Note that the RLE encoding can only be used in
combination with the BOOLEAN, INT32 and INT64 types. The RLE encoding
requires an additional bitWidth parameter that contains the maximum number of
bits required to store the largest value of the field.
` js
var schema = new parquet.ParquetSchema({
age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});
`
Optional Fields
---------------
By default, all fields are required to be present in each row. You can also mark
a field as 'optional' which will let you store rows with that field missing:
` js
var schema = new parquet.ParquetSchema({
name: { type: 'UTF8' },
quantity: { type: 'INT64', optional: true },
});
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
await writer.appendRow({name: 'apples', quantity: 10 });
await writer.appendRow({name: 'banana' }); // not in stock
`
Nested Rows & Arrays
--------------------
Parquet supports nested schemas that allow you to store rows that have a more
complex structure than a simple tuple of scalar values. To declare a schema
with a nested field, omit the type in the column definition and add a fields
list instead:
Consider this example, which allows us to store a more advanced "fruits" table
where each row contains a name, a list of colours and a list of "stock" objects.
` js
// advanced fruits table
var schema = new parquet.ParquetSchema({
name: { type: 'UTF8' },
colours: { type: 'UTF8', repeated: true },
stock: {
repeated: true,
fields: {
price: { type: 'DOUBLE' },
quantity: { type: 'INT64' },
}
}
});
// the above schema allows us to store the following rows:
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
await writer.appendRow({
name: 'banana',
colours: ['yellow'],
stock: [
{ price: 2.45, quantity: 16 },
{ price: 2.60, quantity: 420 }
]
});
await writer.appendRow({
name: 'apple',
colours: ['red', 'green'],
stock: [
{ price: 1.20, quantity: 42 },
{ price: 1.30, quantity: 230 }
]
});
await writer.close();
// reading nested rows with a list of explicit columns
let reader = await parquet.ParquetReader.openFile('fruits.parquet');
let cursor = reader.getCursor([['name'], ['stock', 'price']]);
let record = null;
while (record = await cursor.next()) {
console.log(record);
}
await reader.close();
`
It might not be obvious why one would want to implement or use such a feature when
the same can - in principle - be achieved by serializing the record using JSON
(or a similar scheme) and then storing it into a UTF8 field:
Putting aside the philosophical discussion on the merits of strict typing,
knowing about the structure and subtypes of all records (globally) means we do not
have to duplicate this metadata (i.e. the field names) for every record. On top
of that, knowing about the type of a field allows us to compress the remaining
data more efficiently.
Nested Lists for Hive / Athena
-----------------------
Lists have to be annotated to be queriable with AWS Athena. See parquet-format for more detail and a full working example with comments in the test directory (test/list.js)
List of Supported Types & Encodings
-----------------------------------
We aim to be feature-complete and add new features as they are added to the
Parquet specification; this is the list of currently implemented data types and
encodings:
Logical Type Primitive Type Encodings
UTF8 BYTE_ARRAY PLAIN
JSON BYTE_ARRAY PLAIN
BSON BYTE_ARRAY PLAIN
BYTE_ARRAY BYTE_ARRAY PLAIN
TIME_MILLIS INT32 PLAIN, RLE
TIME_MICROS INT64 PLAIN, RLE
TIMESTAMP_MILLIS INT64 PLAIN, RLE
TIMESTAMP_MICROS INT64 PLAIN, RLE
BOOLEAN BOOLEAN PLAIN, RLE
FLOAT FLOAT PLAIN
DOUBLE DOUBLE PLAIN
INT32 INT32 PLAIN, RLE
INT64 INT64 PLAIN, RLE
INT96 INT96 PLAIN
INT_8 INT32 PLAIN, RLE
INT_16 INT32 PLAIN, RLE
INT_32 INT32 PLAIN, RLE
INT_64 INT64 PLAIN, RLE
UINT_8 INT32 PLAIN, RLE
UINT_16 INT32 PLAIN, RLE
UINT_32 INT32 PLAIN, RLE
UINT_64 INT64 PLAIN, RLE
Buffering & Row Group Size
--------------------------
When writing a Parquet file, the ParquetWriter will buffer rows in memory
until a row group is complete (or close() is called) and then write out the row
group to disk.
The size of a row group is configurable by the user and controls the maximum
number of rows that are buffered in memory at any given time as well as the number
of rows that are co-located on disk:
` js
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);
``