DataFrame class for Observable framework
npm install observable-dataframeWhat I've done here is waste a bunch of time... also, I've created a JavaScript implementation of a pandas-like DataFrame for data manipulation and analysis in JavaScript. Once again, yes. There are already pretty good implementations of this, like arquero. However:
1. I wanted to implement my own minimal version and nothing teaches you like implementing your own stuff.
2. I didn't think arquero was python-esque enough for me.
3. I don't expect anyone else to use this stuff except me.
With Claude and OpenAI these days, it's easier than ever to _learn_, which is exactly why I've done what I've done. As I really am starting to love TypeScript (save for the analysis part), this implementation aims to provide a Python/pandas-like experience in JavaScript, making data manipulation and analysis more intuitive for Python developers, like me.
---
A JavaScript implementation of a pandas-like DataFrame for data manipulation and analysis. This library provides a Python/pandas-like experience in JavaScript, making data transformation and analysis more intuitive.
``bash`
npm install observable-dataframe
1. Basic Usage
- Creating DataFrames
- Viewing Data
- Basic Operations
2. Data Manipulation
- Column Types and Transformations
- Adding or Modifying Columns
- Filtering Data
- Sorting Data
3. Aggregation Operations
- Basic Aggregation
- Grouping Data
- Time Window Aggregation
4. Statistical Analysis
- Descriptive Statistics
- Correlation Analysis
5. Advanced Features
- Concurrent Processing
- Iteration Methods
- Mathematical Operations
6. Method Reference
`javascript
import { DataFrame } from "observable-dataframe";
// From array of objects (CSV-like data)
const data = [
{ name: "Alice", age: 30, city: "New York" },
{ name: "Bob", age: 25, city: "Los Angeles" },
{ name: "Charlie", age: 35, city: "Chicago" },
];
const df = new DataFrame(data);
// From CSV file (using FileAttachment in Observable)
const csvData = await FileAttachment("./data/mydata.csv").csv();
const df = new DataFrame(csvData);
// From column-based data
const columnData = {
name: ["Alice", "Bob", "Charlie"],
age: [30, 25, 35],
city: ["New York", "Los Angeles", "Chicago"],
};
const df = new DataFrame(columnData);
`
`javascript
// Get first few rows
df.head(3);
// Get last few rows
df.tail(3);
// Basic view (in Observable)
df.table();
// Print with options
df.print({ max_rows: 10, precision: 2 });
`
`javascript
// Select specific columns
df.select(["name", "age"]);
// Basic information
df.describe().table();
// Get raw data
df.to_data();
`
`javascript
// Set types for columns
df.setType("age", "number");
df.setTypes({ age: "number", timestamp: "date" });
// Apply function to a column
df.apply("name", (name) => name.toUpperCase());
// Map values in a column
df.map("status", { A: "Active", I: "Inactive" });
`
`javascript
// Add new columns with math operations
df.with_columns({
// Using functions that work on each row
adult: (row) => (row.age >= 18 ? "Yes" : "No"),
name_length: (row) => row.name.length,
// Using simple calculations
age_in_months: (row) => row.age * 12,
// Adding constant values
cohort: "2023",
});
// Another way to add columns
df.assign("adult", (row) => (row.age >= 18 ? "Yes" : "No"));
`
#### Advanced Column Transformations
The .with_columns() method is powerful for data transformation and can be combined with other methods for complex workflows:
`javascript
// Load dataset of medical events
const events_df = new DataFrame(medical_events);
// Calculate derived risk measures
const risk_df = events_df
.filter()
.gt("baseline_measurement", 0) // Filter for valid baseline
.execute()
.with_columns({
// Calculate absolute risk
absolute_risk: (row) => (row.events / row.population) * 100,
// Calculate risk ratios compared to baseline
risk_ratio: (row) =>
row.events /
row.population /
(row.baseline_events / row.baseline_population),
// Flag high-risk events
high_risk: (row) =>
row.events / row.population > 0.05 ? "High" : "Normal",
// Calculate risk-adjusted costs
adjusted_cost: (row) => row.cost * Math.sqrt(row.risk_score),
});
// Chain with groupby for risk stratification
const risk_strata = risk_df
.groupby(["region", "high_risk"])
.agg({
absolute_risk: ["mean", "max"],
adjusted_cost: ["sum", "mean"],
})
.with_columns({
// Add cost per risk unit
cost_per_risk_unit: (row) => row.adjusted_cost_sum / row.absolute_risk_mean,
});
`
The DataFrame offers multiple ways to filter data:
#### Using filter() method with chainable conditions
`javascript
// Filter with multiple conditions
const filteredData = df
.filter()
.gt("age", 25) // age > 25
.lt("income", 100000) // income < 100000
.neq("status", "Inactive") // status != "Inactive"
.execute(); // Don't forget to execute!
// More complex filtering
const results = df
.filter()
.lte("P Value", 0.05) // P Value <= 0.05
.lte("Absolute Prevalence DiD", 0) // Correctly signed
.notNull("Prevalence Measurement Period (Comparator)") // Not null values
.execute()
.with_columns({
// Add calculated columns to filtered results
absolute_events: (row) => row["Prevalence"] * population,
events_prevented: (row) =>
row["Prevalence"] * population -
row["Prevalence"] population (1 + row["Relative DiD"]),
});
`
#### Available filter conditions
`javascript
// Comparison operators
df.filter()
.gt("column", value) // Greater than
.lt("column", value) // Less than
.gte("column", value) // Greater than or equal
.lte("column", value) // Less than or equal
.eq("column", value) // Equal to
.neq("column", value) // Not equal to
.between("column", lower, upper) // Between values (inclusive)
// Null/NaN handling
.isNull("column") // Is null or undefined
.notNull("column") // Is not null or undefined
.notNan("column") // Is not NaN (for numeric columns)
// Set membership
.isIn("column", [values]) // Value is in array
// Custom conditions
.where((row) => customLogic(row)) // Custom function
// Execute the filter
.execute();
`
#### Using query() with expression strings
`javascript`
// Simple querying with strings
df.query("row.age > 30");
df.query("row.city === 'New York' && row.age < 40");
#### Using loc() with functions
`javascript`
// Filtering with a function
df.loc((row) => row.age > 30 && row.city.startsWith("New"));
`javascript
// Sort by a column (ascending)
df.sort_values("age");
// Sort by a column (descending)
df.sort_values("age", false);
// Sort by multiple columns
df.sort_values(["city", "age"]);
// Sort by multiple columns with different directions
df.sort_values(["city", "age"], [true, false]); // city ascending, age descending
`
`javascript
// Sum of a column
df.sum("age");
// Mean of a column
df.mean("age");
// Multiple column aggregation
df.sum(["age", "income"]);
df.mean(["age", "income"]);
// Get unique values
df.unique("city");
df.unique(["city", "status"]);
// Count values
df.value_counts("city");
`
`javascript
// Basic groupby with aggregation
const summary = df.groupby(["city"]).agg({
age: ["min", "mean", "max", "sum"],
income: "mean",
});
// Multiple groupby columns
const detailedSummary = df.groupby(["city", "gender"]).agg({
age: ["min", "max", "mean"],
income: ["mean", "sum"],
});
// Other aggregation types
const aggResult = df.groupby(["city"]).agg({
status: ["first", "last", "nunique"], // first/last value, number of unique values
name: ["concat"], // concatenate strings
});
// Iterating through groups
for (const [key, group] of df.groupby("city", { iterable: true })) {
console.log(City: ${key.city});Average age: ${group.mean("age")}
console.log();`
}
#### Available Aggregation Functions
Both groupby() and groupby_dynamic() support these aggregation functions:
| Aggregation | Description | Works With |
| ----------------------------- | -------------------------- | --------------- |
| "min" | Minimum value | Numeric columns |"max"
| | Maximum value | Numeric columns |"mean"
| | Average value | Numeric columns |"sum"
| | Sum of values | Numeric columns |"first"
| | First value in group | Any column type |"last"
| | Last value in group | Any column type |"nunique"
| | Count of unique values | Any column type |"concat"
| | Concatenate string values | String columns |"nested_concat"
| | Returns [head, tail] array | Any column type |"nested_concat_array"
| | Groups identical values | Any column type |"nested_concat_array_count"
| | Count of identical values | Any column type |
Example using each aggregation type:
`javascript
// Create a DataFrame with various data types
const sales_df = new DataFrame([
{
date: "2023-01-01",
region: "North",
product: "A",
quantity: 10,
price: 25.5,
tags: "sale,featured",
},
{
date: "2023-01-02",
region: "North",
product: "B",
quantity: 15,
price: 19.99,
tags: "new",
},
{
date: "2023-01-01",
region: "South",
product: "A",
quantity: 8,
price: 24.99,
tags: "sale",
},
{
date: "2023-01-02",
region: "South",
product: "C",
quantity: 20,
price: 34.5,
tags: "featured",
},
{
date: "2023-01-03",
region: "North",
product: "A",
quantity: 12,
price: 25.0,
tags: "sale",
},
]);
// Use various aggregation types
const aggregated = sales_df.groupby(["region", "product"]).agg({
quantity: ["min", "max", "mean", "sum"], // Numeric aggregations
price: ["mean", "min"], // More numeric aggregations
date: ["first", "last"], // First/last values
tags: ["nunique", "concat"], // Count unique, concatenate strings
product: ["nested_concat", "nested_concat_array"], // Nested aggregations
});
/*
Results include:
- quantity_min, quantity_max, quantity_mean, quantity_sum
- price_mean, price_min
- date_first, date_last
- tags_nunique, tags_concat
- product_nested_concat, product_nested_concat_array
*/
`
The _performAggregation method internally handles these aggregation types by:
1. Initializing aggregation accumulators for each group
2. Processing each row and updating the accumulators
3. Calculating final results based on accumulated values
4. Formatting the output as a new DataFrame
The groupby_dynamic() method provides powerful time-series analysis capabilities by grouping data into time windows.
`javascript`
// Time-based window aggregation
const timeSeriesDf = df.groupby_dynamic(
"timestamp", // Column with timestamps
{
value: ["mean", "sum"], // Columns to aggregate
},
{
every: "1d", // Window step size
period: "7d", // Window size
offset: "0h", // Time offset
closed: "left", // Window boundary inclusion
label: "left", // Window labeling strategy
include_boundaries: true, // Include window bounds in output
groupby_start_date: "day", // Align windows to calendar units
}
);
#### Dynamic Time Window Example
Here's a real-world example analyzing sensor data with rolling time windows:
`javascript
// Sensor data with timestamps
const sensor_df = new DataFrame(sensorData);
// Set column types
sensor_df.setTypes({
timestamp: "date",
temperature: "number",
humidity: "number",
pressure: "number",
});
// Calculate hourly rolling windows with 10-minute steps
const hourly_stats = sensor_df.groupby_dynamic(
"timestamp", // Time column
{
temperature: ["min", "mean", "max"],
humidity: ["mean", "max"],
pressure: ["mean"],
},
{
every: "10m", // Step forward every 10 minutes
period: "1h", // Use 1 hour windows
offset: "0m", // No offset
closed: "left", // Include left boundary, exclude right
label: "left", // Label windows by start time
include_boundaries: true,
groupby_start_date: "minute", // Align to calendar minutes
}
);
// Calculate day-over-day changes using window aggregation
const daily_changes = sensor_df
.groupby_dynamic(
"timestamp",
{
temperature: ["mean", "first", "last"],
humidity: ["mean"],
},
{
every: "1d", // Daily windows
period: "1d", // 1 day window size
}
)
.with_columns({
// Calculate temperature change
temp_change: (row) => row.temperature_last - row.temperature_first,
// Calculate vs 24 hours before (if data available)
day_over_day_temp: (row) => {
// Logic to compare with previous day's data
return row.temperature_mean - previous_day_temp;
},
});
`
`javascript
// Get detailed statistics
const stats = df.describe();
// Access specific statistics
const mean = df.mean("age");
const maximum = df.max("age");
const minimum = df.min("age");
// Percentiles
const median = df.percentile(0.5, "age");
const quartiles = df.percentile(0.25, ["age", "income"]);
`
`javascript
// Correlation between two columns
const correlation = df.corr("age", "income");
// Correlation matrix for all numeric columns
const corrMatrix = df.corrMatrix();
// Visualize correlation matrix
const plot = df.corrPlot({
width: 600,
height: 600,
marginTop: 100,
scheme: "blues", // Color scheme
decimals: 2, // Decimal places to display
});
`
`javascript
// Regular groupby
const regularResult = df.groupby(["state", "city"]).agg({
population: ["min", "mean", "max"],
income: "mean",
});
// Concurrent groupby for larger datasets
const concurrentResult = await df.concurrentGroupBy(["state", "city"], {
population: ["min", "mean", "max"],
income: "mean",
});
// Compare performance
const perfResults = await comparePerformance();
`
`javascriptRow ${index}:
// Iterate row by row with index
for (const [index, row] of df.iterrows()) {
console.log(, row);
}
// Iterate with named tuples
for (const row of df.itertuples("Record")) {
console.log(${row.name} is ${row.age} years old);
}
// Iterate column by column
for (const [column, values] of df.items()) {
console.log(Column ${column}:, values);
}
// Direct iteration over rows
for (const row of df) {
console.log(row);
}
`
`javascript
// Add a scalar
const df2 = df.add(10); // Add 10 to all numeric columns
// Subtract a scalar
const df3 = df.sub(5); // Subtract 5 from all numeric columns
// Multiply
const df4 = df.mul(2); // Double all numeric columns
// Divide
const df5 = df.div(100); // Divide all numeric columns by 100
// DataFrame math - add two DataFrames
const dfSum = df1.add(df2);
`
- Basic Operations
- constructor(data, options) - Create a new DataFramehead(n)
- - Get first n rowstail(n)
- - Get last n rowsselect(columns)
- - Select specified columnsprint(options)
- - Format data for printingtable(options)
- - Return an HTML table viewto_data()
- - Convert to array of objects
- Data Manipulation
- filter() - Chain filtering operationsloc(condition)
- - Filter with a functionquery(expr)
- - Filter with a string expressionsetType(column, type)
- - Set column typesetTypes(typeMap)
- - Set multiple column typeswith_columns(columnExpressions)
- - Add/modify columnsassign(columnName, values)
- - Add a new columndrop(columns)
- - Remove columnsdropna(options)
- - Remove rows with missing valuesfillna(value)
- - Replace missing valuesrename(columnMap)
- - Rename columnsapply(column, func)
- - Apply function to columnmap(column, mapper)
- - Map values in columnsort_values(columns, ascending)
- - Sort datamerge(other, options)
- - Join two DataFrames
- Aggregation
- groupby(columns, options) - Group by columnsgroupby_dynamic(timeColumn, aggSpec, options)
- - Time-based groupingconcurrentGroupBy(columns, aggregations)
- - Parallel groupingunique(columns)
- - Get unique valuesmean(columns)
- - Calculate meansum(columns)
- - Calculate summax(columns)
- - Get maximum valuesmin(columns)
- - Get minimum valuesvalue_counts(column)
- - Count unique values
- Statistics
- describe() - Generate descriptive statisticspercentile(p, columns)
- - Calculate percentilescorr(col1, col2)
- - Calculate correlationcorrMatrix()
- - Generate correlation matrixcorrPlot(options)
- - Visualize correlation matrix
- Math Operations
- add(other) - Add scalar or DataFramesub(other)
- - Subtract scalar or DataFramemul(other)
- - Multiply by scalar or DataFramediv(other)
- - Divide by scalar or DataFrame
- Iteration
- iterrows() - Iterate over [index, row] pairsitertuples(name)
- - Iterate over row tuplesitems()
- - Iterate over [column, values] pairs[Symbol.iterator]()
- - Direct iteration over rows
- Other
- append(row) - Add a single rowextend(rows)
- - Add multiple rowsterminate()
- - Clean up workers
`javascript
import { DataFrame } from "observable-dataframe";
// Load data
const va_data = await FileAttachment("./data/veterans_data.csv").csv();
const va_df = new DataFrame(va_data);
// Ensure numeric columns
va_df.setType("Veterans", "number");
// Create derivative columns
va_df.with_columns({
belle_targets: (row) => row.Veterans * 0.5,
remaining_vets: (row) => row.Veterans * 0.5,
});
// Aggregate by state
const state_summary = va_df.groupby(["Abbreviation"]).agg({
Veterans: "sum",
belle_targets: "sum",
remaining_vets: "sum",
});
// Get total for further calculations
const veterans_sum = state_summary.sum(["Veterans_sum"]) / 2;
`
`javascript
// Load and prepare study data
const study_df = new DataFrame(clinical_data);
// Calculate derived metric
const engaged_population =
(((population at_risk_fraction) / 100) engagement_rate) / 100;
// Filter for significant results and add impact metrics
const significant_results = study_df
.filter()
.lte("P Value", 0.05) // Statistically significant
.lte("Absolute Prevalence DiD", 0) // Correctly signed effect
.notNull("Prevalence Measurement Period (Comparator)") // Valid data
.execute()
.with_columns({
// Calculate impact metrics
absolute_events: (row) =>
row["Prevalence Measurement Period (Comparator)"] * engaged_population,
absolute_events_with_intervention: (row) =>
row["Prevalence Measurement Period (Comparator)"] *
engaged_population *
(1 + row["Relative DiD"]),
events_prevented: (row) =>
row["Prevalence Measurement Period (Comparator)"] * engaged_population -
row["Prevalence Measurement Period (Comparator)"] *
engaged_population *
(1 + row["Relative DiD"]),
});
``