An experimental client for Apache Spark Connect written in TypeScript
npm install @yaooqinn/spark.js


An experimental client for Apache Spark Connect written in TypeScript. This library allows JavaScript and TypeScript applications to interact with Apache Spark using the Spark Connect protocol over gRPC.
> ā ļø Experimental: This library is in active development. APIs may change without notice.
š API Documentation - Comprehensive API reference with examples
The API documentation is automatically generated from JSDoc comments and TypeScript definitions.
- š TypeScript-first: Full TypeScript support with comprehensive type definitions
- š Spark Connect Protocol: Uses the official Spark Connect gRPC protocol
- š DataFrame API: Familiar DataFrame operations (select, filter, groupBy, join, etc.)
- š¾ Multiple Data Formats: Support for CSV, JSON, Parquet, ORC, and more
- š SQL Support: Execute SQL queries directly
- š Catalog API: Metadata operations (databases, tables, functions)
- ā” Apache Arrow: Efficient data transfer using Apache Arrow
- šÆ Type-safe: Built-in type system for Spark data types
- Node.js: v20 or higher
- Apache Spark: 4.0+ with Spark Connect enabled
- For development/testing: Docker is used to run Spark Connect server
Install from npm:
``bash`
npm install spark.js
Or clone the repository for development:
`bash`
git clone https://github.com/yaooqinn/spark.js.git
cd spark.js
npm install
Here's a minimal example to get started:
`typescript
import { SparkSession } from 'spark.js';
async function main() {
// Create a SparkSession connected to a Spark Connect server
const spark = await SparkSession.builder()
.appName('MyApp')
.remote('sc://localhost:15002')
.getOrCreate();
// Create a simple DataFrame
const df = spark.range(1, 100)
.selectExpr('id', 'id * 2 as doubled');
// Show the first 10 rows
await df.show(10);
// Perform aggregation
const count = await df.count();
console.log(Total rows: ${count});
}
main().catch(console.error);
`
The SparkSession is the entry point for all Spark operations:
`typescript
import { SparkSession } from 'spark.js';
// Connect to a remote Spark Connect server
const spark = await SparkSession.builder()
.appName('MyApplication')
.remote('sc://localhost:15002') // Spark Connect endpoint
.getOrCreate();
// Get Spark version
const version = await spark.version();
console.log(Spark version: ${version});`
Use the DataFrameReader to load data from various sources:
`typescript
// Read CSV file
const csvDF = spark.read
.option('header', true)
.option('delimiter', ';')
.csv('path/to/people.csv');
// Read JSON file
const jsonDF = spark.read.json('path/to/data.json');
// Read Parquet file
const parquetDF = spark.read.parquet('path/to/data.parquet');
// Read with schema inference
const df = spark.read
.option('inferSchema', true)
.option('header', true)
.csv('data.csv');
`
Perform transformations and actions on DataFrames:
`typescript
import { functions } from 'spark.js';
const { col, lit } = functions;
// Select columns
const selected = df.select('name', 'age');
// Filter rows
const filtered = df.filter(col('age').gt(21));
// Add/modify columns
const transformed = df
.withColumn('age_plus_one', col('age').plus(lit(1)))
.withColumnRenamed('name', 'full_name');
// Group by and aggregate
const aggregated = df
.groupBy('department')
.agg({ salary: 'avg', age: 'max' });
// Join DataFrames
const joined = df1.join(df2, df1.col('id').equalTo(df2.col('user_id')), 'inner');
// Sort
const sorted = df.orderBy(col('age').desc());
// Limit
const limited = df.limit(100);
// Collect results
const rows = await df.collect();
rows.forEach(row => console.log(row.toJSON()));
`
Execute SQL queries directly:
`typescript
// Register DataFrame as temporary view
df.createOrReplaceTempView('people');
// Execute SQL
const resultDF = await spark.sql(
SELECT name, age, department
FROM people
WHERE age > 30
ORDER BY age DESC);
await resultDF.show();
`
Save DataFrames to various formats:
`typescript
// Write as Parquet (default mode: error if exists)
await df.write.parquet('output/data.parquet');
// Write as CSV with options
await df.write
.option('header', true)
.option('delimiter', '|')
.mode('overwrite')
.csv('output/data.csv');
// Write as JSON
await df.write
.mode('append')
.json('output/data.json');
// Partition by column
await df.write
.partitionBy('year', 'month')
.parquet('output/partitioned_data');
// V2 Writer API (advanced)
await df.writeTo('my_table')
.using('parquet')
.partitionBy('year', 'month')
.tableProperty('compression', 'snappy')
.create();
`
See guides/DataFrameWriterV2.md for more V2 Writer examples.
Explore metadata using the Catalog API:
`typescript
// List databases
const databases = await spark.catalog.listDatabases();
// List tables in current database
const tables = await spark.catalog.listTables();
// List columns of a table
const columns = await spark.catalog.listColumns('my_table');
// Check if table exists
const exists = await spark.catalog.tableExists('my_table');
// Get current database
const currentDB = await spark.catalog.currentDatabase();
`
- SparkSession: Main entry point for Spark functionality
- DataFrame: Distributed collection of data organized into named columns
- Column: Expression on a DataFrame column
- Row: Represents a row of data
- DataFrameReader: Interface for loading data
- DataFrameWriter: Interface for saving data
- DataFrameWriterV2: V2 writer with advanced options
- RuntimeConfig: Runtime configuration interface
- Catalog: Metadata and catalog operations
Import SQL functions from the functions module:
`typescript
import { functions } from 'spark.js';
const { col, lit, sum, avg, max, min, count, when, concat, upper } = functions;
const df = spark.read.csv('data.csv');
df.select(
col('name'),
upper(col('name')).as('upper_name'),
when(col('age').gt(18), lit('adult')).otherwise(lit('minor')).as('category')
);
`
See guides/STATISTICAL_FUNCTIONS.md for statistical functions.
Define schemas using the type system:
`typescript
import { DataTypes, StructType, StructField } from 'spark.js';
const schema = new StructType([
new StructField('name', DataTypes.StringType, false),
new StructField('age', DataTypes.IntegerType, true),
new StructField('salary', DataTypes.DoubleType, true)
]);
const df = spark.createDataFrame(data, schema);
`
Configure the connection to Spark Connect server:
`typescript`
const spark = await SparkSession.builder()
.appName('MyApp')
.remote('sc://host:port') // Default: sc://localhost:15002
.getOrCreate();
Set Spark configuration at runtime:
`typescript
// Set configuration
await spark.conf.set('spark.sql.shuffle.partitions', '200');
// Get configuration
const value = await spark.conf.get('spark.sql.shuffle.partitions');
// Get with default
const valueOrDefault = await spark.conf.get('my.config', 'default_value');
`
Logging is configured in log4js.json. Logs are written to both console and logs/ directory.
For contributors, comprehensive documentation is available in the Contributor Guide:
- Getting Started - Set up your development environment
- Code Style Guide - Coding conventions and best practices
- Build and Test - Building, testing, and running the project
- IDE Setup - Recommended IDE configurations
- Submitting Changes - How to submit pull requests
`bashClone and install
git clone https://github.com/yaooqinn/spark.js.git
cd spark.js
npm install
For detailed instructions, see the Contributor Guide.
$3
`
spark.js/
āāā src/
ā āāā gen/ # Generated protobuf code (DO NOT EDIT)
ā āāā org/apache/spark/
ā āāā sql/ # Main API implementation
ā ā āāā SparkSession.ts # Entry point
ā ā āāā DataFrame.ts # DataFrame API
ā ā āāā functions.ts # SQL functions
ā ā āāā types/ # Type system
ā ā āāā catalog/ # Catalog API
ā ā āāā grpc/ # gRPC client
ā ā āāā proto/ # Protocol builders
ā āāā storage/ # Storage levels
āāā tests/ # Test suites
āāā example/ # Example applications
āāā docs/ # Additional documentation
āāā protobuf/ # Protocol buffer definitions
āāā .github/
ā āāā workflows/ # CI/CD workflows
ā āāā docker/ # Spark Connect Docker setup
āāā package.json # Dependencies and scripts
āāā tsconfig.json # TypeScript configuration
āāā jest.config.js # Jest test configuration
āāā eslint.config.mjs # ESLint configuration
āāā buf.gen.yaml # Buf protobuf generation config
`Examples
The
example/ directory contains several runnable examples:- Pi.ts: Monte Carlo Pi estimation
- CSVExample.ts: Reading and writing CSV files
- ParquetExample.ts: Parquet file operations
- JsonExample.ts: JSON file operations
- JoinExample.ts: DataFrame join operations
- CatalogExample.ts: Catalog API usage
- StatisticalFunctionsExample.ts: Statistical functions
To run an example:
`bash
Make sure Spark Connect server is running
npx ts-node example/org/apache/spark/sql/example/Pi.ts
`Contributing
Contributions are welcome! Please read the Contributor Guide for detailed information on:
- Setting up your development environment
- Code style and conventions
- Building and testing
- Submitting pull requests
$3
1. Fork the repository and create a feature branch
2. Follow the Code Style Guide
3. Add tests for new functionality
4. Run checks:
npm run lint and npm test
5. Submit a pull request with a clear descriptionSee Submitting Changes for detailed instructions.
Roadmap
- [ ] For minor changes or some features associated with certain classes, SEARCH 'TODO'
- [ ] Support Retry / Reattachable execution
- [ ] Support Checkpoint for DataFrame
- [ ] Support DataFrameNaFunctions
- [ ] Support User-Defined Functions (UDF)
- [ ] UDF registration via
spark.udf.register()
- [ ] Inline UDFs via udf() function
- [x] Java UDF registration via spark.udf.registerJava()`This project is licensed under the Apache License 2.0.
---
Note: This is an experimental project. For production use, please refer to the official Apache Spark documentation and consider using official Spark clients.