Principal Components Analysis in javascript
npm install pca-jsBugfixes: Closed the 4 open issues with chonky fixes. Closed all PRs since full rewrite. Basically optimized matmul ops now to prevent OOMs and optimized build processes to port typescript. Feel free to Open new Issues/PRs as required.
How to use now: Same as below except for the changed CDN links. I've inline edited the older readme in order to accurately reflect links and examples using modern ESM/UMD/CJS.
- 🛠️ Node.js CommonJS: const PCA = require('pca-js')
- 🛠️ Node.js ESM: import PCA from 'pca-js'
→ window.PCA import PCA from 'https://cdn.jsdelivr.net/npm/pca-js@2.0.6/+esm'Can use unpkg etc but only jsdelivr is actively supported and tested by me. Open an issue for better CDN recommends.
All methods are exposed through PCA global variable
Say you have data for marks of a class 4 students in 3 examinations on the same subject:
```
Student 1: 40,50,60
Student 2: 50,70,60
Student 3: 80,70,90
Student 4: 50,60,80
You want to examine whether it is possible to come up with a single descriptive set of scores which explains performance across the class. Alternatively, whether it would make sense to replace 3 exams with just one (and reduce stress on students).
First get the set of eigenvectors and eigenvalues (principal components and adjusted loadings)
`js`
var data = [[40,50,60],[50,70,60],[80,70,90],[50,60,80]];
var vectors = PCA.getEigenVectors(data);
//Outputs
// [{
// "eigenvalue": 520.0992658908312,
// "vector": [0.744899700771276, 0.2849796479974595, 0.6032503924724023]
// }, {
// "eigenvalue": 78.10455398035167,
// "vector": [0.2313199078283626, 0.7377809866160473, -0.6341689964277106]
// }, {
// "eigenvalue": 18.462846795484058,
// "vector": [0.6257919271076777, -0.6119361208615616, -0.4836513702572988]
// }]
Now you'd need to find a set of eigenvectors that would explain a decent amount of variance across your exams (thus telling you if 1 test or 2 tests would suffice instead of three)
`js`
var first = PCA.computePercentageExplained(vectors,vectors[0])
// 0.8434042149581044
var topTwo = PCA.computePercentageExplained(vectors,vectors[0],vectors[1])
// 0.9700602484397556
So if you wanted to have 97% certainty, that someone wouldn't just flunk out accidentally, you'd take 2 exams. But let's say you just wanted to take 1, explaining 84% of variance is good enough. And instead of taking the examination again, you just wanted a normalized score
`js`
var adData = PCA.computeAdjustedData(data,vectors[0])
// {
// "adjustedData": [
// [-22.27637101744241, -9.127781049780463, 31.316721747529886, 0.08743031969298887]
// ],
// "formattedAdjustedData": [
// [-22.28, -9.13, 31.32, 0.09]
// ],
// "avgData": [
// [-55, -62.5, -72.5],
// [-55, -62.5, -72.5],
// [-55, -62.5, -72.5],
// [-55, -62.5, -72.5]
// ],
// "selectedVectors": [
// [0.744899700771276, 0.2849796479974595, 0.6032503924724023]
// ]
// }
The adjustedData is centered (mean = 0), but you could always set the mean to something like 50, to get scores of [-22.27637101744241, -9.127781049780463, 31.316721747529886, 0.08743031969298887].map(score=>Math.round(score+50)) equal to [28, 41, 81, 50] , and that's how well your students would have done, in the order of students.
#### Compression (lossy):
`js`
var compressed = adData.formattedAdjustedData;
//[
// [-22.28, -9.13, 31.32, 0.09]
// ]
var uncompressed = PCA.computeOriginalData(compressed,adData.selectedVectors,adData.avgData);
//uncompressed.formattedOriginalData (lossy since 2 eigenvectors are removed)
// [
// [38.4, 56.15, 59.06],
// [48.2, 59.9, 66.99],
// [78.33, 71.43, 91.39],
// [55.07, 62.53, 72.55]
// ]
Compare this to the original data to understand just how lossy the compression was
``
//Original Data
[
[40, 50, 60],
[50, 70, 60],
[80, 70, 90],
[50, 60, 80]
]
//Uncompressed Data
[
[38.4, 56.15, 59.06],
[48.2, 59.9, 66.99],
[78.33, 71.43, 91.39],
[55.07, 62.53, 72.55]
]List of Methods
#### computeDeviationMatrix(data)
Find centered matrix from original data
#### computeDeviationScores(centeredMatrix)
Find deviation from mean for values in matrix
#### computeSVD(deviationScores)
Singular Value Decomposition of matrix
#### computePercentageExplained(allvectors, ...selected)
Find the cumulative percentage explained variance by selected vectors (select vectors accordingly to view specific explained variance)
#### computeOriginalData(compressedData,selectedVectors,avgData)
Get original data from the adjusted data after selecting a few eigenvectors
#### computeVarianceCovariance(devSumOfSquares,isSample)
Get variance covariance matrix from the data, adjust n by one if the data is from a sample
#### computeAdjustedData(initialData, ...selectedVectors)
Get adjusted data using principal components as selected
#### getEigenVectors(initialData)
Get the principal components of data using the steps outlined above.
#### analyseTopResult(initialData)
Same as computeAdjustedData(initialData,vectors[0]). Selecting only the top eigenvector which explains the most variance.