parquet.js

fully asynchronous, pure node.js implementation of the Parquet file format

![Build Status](http://travis-ci.org/ironSource/parquetjs)
![License: MIT](https://opensource.org/licenses/MIT)
![npm version](https://badge.fury.io/js/parquetjs-lite-gas)

This package contains a fully asynchronous, pure JavaScript implementation of
the Parquet file format. The implementation conforms with the
Parquet specification and is tested
for compatibility with Apache's Java reference implementation.

This is a lite read-only version that is modified to work with Google Apps Script through UrlFetchApp.

What is Parquet?: Parquet is a column-oriented file format; it allows you to
write a large amount of structured data to a file, compress it and then read parts
of it back out efficiently. The Parquet format is based on Google's Dremel paper.

Installation
------------

Package it with https://github.com/mahaker/esbuild-gas-plugin and add the following
to the build parameters in build.js:

``js define: { // util.js 'process.env.NODE_DEBUG': false, // int53 'console.assert': 'assert' }`

and expose a global assert function in GAS:

`js function assert(condition, message) { if (!condition) { if (!message) { throw Error("Assertion failed"); } throw Error(message); } }`

_parquet.js requires node.js >= 7.6.0_

Usage: Reading files --------------------

A parquet reader allows retrieving the rows from a parquet file in order. The basic usage is to create a reader and then retrieve a cursor/iterator which allows you to consume row after row until all rows have been read.

You may open more than one cursor and use them concurrently. All cursors become invalid once close() is called on the reader object.

Parquet files can be read from a url without having to download the whole file. You will have to supply the UrlFetchApp library as a first argument, the URL as the second parameter, request parameters as a third argument, and reader options as a fourth argument to the functionparquetReader.openUrl.

`js const request = require('request'); let reader = await parquet.ParquetReader.openUrl(UrlFetchApp,'https://domain/fruits.parquet');`

When creating a cursor, you can optionally request that only a subset of the columns should be read from disk. For example:

`js // create a new cursor that will only return thename and pricecolumns let cursor = reader.getCursor(['name', 'price']);`

It is important that you call close() after you are finished reading the file to avoid leaking file descriptors.

`js await reader.close();`

`$3`

If the complete parquet file is in buffer it can be read directly from memory without incurring any additional I/O.

`js const file = fs.readFileSync('fruits.parquet'); let reader = await parquet.ParquetReader.openBuffer(file);`

Encodings ---------

Internally, the Parquet format will store values from each field as consecutive arrays which can be compressed/encoded using a number of schemes.

#### Plain Encoding (PLAIN)

The most simple encoding scheme is the PLAIN encoding. It simply stores the values as they are without any compression. The PLAIN encoding is currently the default for all types exceptBOOLEAN:

`js var schema = new parquet.ParquetSchema({ name: { type: 'UTF8', encoding: 'PLAIN' }, });`

#### Run Length Encoding (RLE)

The Parquet hybrid run length and bitpacking encoding allows to compress runs of numbers very efficiently. Note that the RLE encoding can only be used in combination with theBOOLEAN, INT32 and INT64types. The RLE encoding requires an additionalbitWidthparameter that contains the maximum number of bits required to store the largest value of the field.

`js var schema = new parquet.ParquetSchema({ age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 }, });`

Optional Fields ---------------

By default, all fields are required to be present in each row. You can also mark a field as 'optional' which will let you store rows with that field missing:

`js var schema = new parquet.ParquetSchema({ name: { type: 'UTF8' }, quantity: { type: 'INT64', optional: true }, });

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet'); await writer.appendRow({name: 'apples', quantity: 10 }); await writer.appendRow({name: 'banana' }); // not in stock`

Nested Rows & Arrays --------------------

Parquet supports nested schemas that allow you to store rows that have a more complex structure than a simple tuple of scalar values. To declare a schema with a nested field, omit thetype in the column definition and add a fieldslist instead:

Consider this example, which allows us to store a more advanced "fruits" table where each row contains a name, a list of colours and a list of "stock" objects.

`js // advanced fruits table var schema = new parquet.ParquetSchema({ name: { type: 'UTF8' }, colours: { type: 'UTF8', repeated: true }, stock: { repeated: true, fields: { price: { type: 'DOUBLE' }, quantity: { type: 'INT64' }, } } });

// the above schema allows us to store the following rows: var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

await writer.appendRow({ name: 'banana', colours: ['yellow'], stock: [ { price: 2.45, quantity: 16 }, { price: 2.60, quantity: 420 } ] });

await writer.appendRow({ name: 'apple', colours: ['red', 'green'], stock: [ { price: 1.20, quantity: 42 }, { price: 1.30, quantity: 230 } ] });

await writer.close();

// reading nested rows with a list of explicit columns let reader = await parquet.ParquetReader.openFile('fruits.parquet');

let cursor = reader.getCursor([['name'], ['stock', 'price']]); let record = null; while (record = await cursor.next()) { console.log(record); }

await reader.close();`

It might not be obvious why one would want to implement or use such a feature when the same can - in principle - be achieved by serializing the record using JSON (or a similar scheme) and then storing it into a UTF8 field:

Putting aside the philosophical discussion on the merits of strict typing, knowing about the structure and subtypes of all records (globally) means we do not have to duplicate this metadata (i.e. the field names) for every record. On top of that, knowing about the type of a field allows us to compress the remaining data more efficiently.

Nested Lists for Hive / Athena -----------------------

Lists have to be annotated to be queriable with AWS Athena. See parquet-format for more detail and a full working example with comments in the test directory (test/list.js)

List of Supported Types & Encodings -----------------------------------

We aim to be feature-complete and add new features as they are added to the Parquet specification; this is the list of currently implemented data types and encodings:

Logical Type	Primitive Type	Encodings
UTF8	BYTE_ARRAY	PLAIN
JSON	BYTE_ARRAY	PLAIN
BSON	BYTE_ARRAY	PLAIN
BYTE_ARRAY	BYTE_ARRAY	PLAIN
TIME_MILLIS	INT32	PLAIN, RLE
TIME_MICROS	INT64	PLAIN, RLE
TIMESTAMP_MILLIS	INT64	PLAIN, RLE
TIMESTAMP_MICROS	INT64	PLAIN, RLE
BOOLEAN	BOOLEAN	PLAIN, RLE
FLOAT	FLOAT	PLAIN
DOUBLE	DOUBLE	PLAIN
INT32	INT32	PLAIN, RLE
INT64	INT64	PLAIN, RLE
INT96	INT96	PLAIN
INT_8	INT32	PLAIN, RLE
INT_16	INT32	PLAIN, RLE
INT_32	INT32	PLAIN, RLE
INT_64	INT64	PLAIN, RLE
UINT_8	INT32	PLAIN, RLE
UINT_16	INT32	PLAIN, RLE
UINT_32	INT32	PLAIN, RLE
UINT_64	INT64	PLAIN, RLE

Buffering & Row Group Size --------------------------

When writing a Parquet file, the ParquetWriterwill buffer rows in memory until a row group is complete (orclose()is called) and then write out the row group to disk.

The size of a row group is configurable by the user and controls the maximum number of rows that are buffered in memory at any given time as well as the number of rows that are co-located on disk:

`js var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet'); writer.setRowGroupSize(8192);``

Dependencies
-------------

Parquet uses thrift to encode the schema and other
metadata, but the actual data does not use thrift.

Notes
-----

Currently parquet-cpp doesn't fully support DATA_PAGE_V2. You can work around this
by setting the useDataPageV2 option to false.

Contributions
-------------
Please make sure you sign the contributor license agreement in order for us to be able to accept your contribution. We thank you very much!

License
-------

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in the
Software without restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the
Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.