Fastest SIMD-Accelerated Vector Similarity Functions for x86 and Arm
- Zero-dependency header-only C 99 library with bindings for Python and JavaScript.
- Targets ARM NEON, SVE, x86 AVX2, AVX-512 (VNNI, FP16) hardware backends.
- Zero-copy compatible with NumPy, PyTorch, TensorFlow, and other tensors.
- Handles f64 double-, f32 single-, and f16 half-precision, i8 integral, and binary vectors.
- __Up to 200x faster__ than [scipy.spatial.distance][scipy] and [numpy.inner][numpy].
- Used in USearch and several DBMS products.
__Implemented distance functions__ include:
- Euclidean (L2), Inner Product, and Cosine (Angular) spatial distances.
- Hamming (~ Manhattan) and Jaccard (~ Tanimoto) binary distances.
- Kullback-Leibler and Jensen–Shannon divergences for probability distributions.
[scipy]: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html#module-scipy.spatial.distance
[numpy]: https://numpy.org/doc/stable/reference/generated/numpy.inner.html
__Technical Insights__ and related articles:
- Uses Horner's method for polynomial approximations, beating GCC 12 by 119x.
- Uses Arm SVE and x86 AVX-512's masked loads to eliminate tail for-loops.
- Uses AVX-512 FP16 for half-precision operations, that few compilers vectorize.
- Substitutes LibC's sqrt calls with bithacks using Jan Kadlec's constant.
- For Python avoids slow PyBind11, SWIG, and even PyArg_ParseTuple for speed.
- For JavaScript uses typed arrays and NAPI for zero-copy calls.
Given 1000 embeddings from OpenAI Ada API with 1536 dimensions, running on the Apple M2 Pro Arm CPU with NEON support, here's how SimSIMD performs against conventional methods:
| Kind | f32 improvement | f16 improvement | i8 improvement | Conventional method | SimSIMD |
| :------------- | ----------------: | ----------------: | ---------------: | :------------------------------------- | :-------------- |
| Cosine | __32 x__ | __79 x__ | __133 x__ | scipy.spatial.distance.cosine | cosine |
| Euclidean ² | __5 x__ | __26 x__ | __17 x__ | scipy.spatial.distance.sqeuclidean | sqeuclidean |
| Inner Product | __2 x__ | __9 x__ | __18 x__ | numpy.inner | inner |
| Jensen Shannon | __31 x__ | __53 x__ | | scipy.spatial.distance.jensenshannon | jensenshannon |
On the Intel Sapphire Rapids platform, SimSIMD was benchmarked against auto-vectorized code using GCC 12. GCC handles single-precision float but might not be the best choice for int8 and _Float16 arrays, which has been part of the C language since 2011.
| Kind | GCC 12 f32 | GCC 12 f16 | SimSIMD f16 | f16 improvement |
| :------------- | -----------: | -----------: | ------------: | ----------------: |
| Cosine | 3.28 M/s | _336.29 k/s_ | _6.88 M/s_ | __20 x__ |
| Euclidean ² | 4.62 M/s | _147.25 k/s_ | _5.32 M/s_ | __36 x__ |
| Inner Product | 3.81 M/s | _192.02 k/s_ | _5.99 M/s_ | __31 x__ |
| Jensen Shannon | 1.18 M/s | _18.13 k/s_ | _2.14 M/s_ | __118 x__ |
__Broader Benchmarking Results__:
- Apple M2 Pro.
- 4th Gen Intel Xeon Platinum.
- AWS Graviton 3.
``sh`
pip install simsimd
`py
import simsimd
import numpy as np
vec1 = np.random.randn(1536).astype(np.float32)
vec2 = np.random.randn(1536).astype(np.float32)
dist = simsimd.cosine(vec1, vec2)
`
Supported functions include cosine, inner, sqeuclidean, hamming, and jaccard.
`py`
batch1 = np.random.randn(100, 1536).astype(np.float32)
batch2 = np.random.randn(100, 1536).astype(np.float32)
dist = simsimd.cosine(batch1, batch2)
If either batch has more than one vector, the other batch must have one or the same number of vectors.
If it contains just one, the value is broadcasted.
For calculating distances between all possible pairs of rows across two matrices (akin to scipy.spatial.distance.cdist):
`py`
matrix1 = np.random.randn(1000, 1536).astype(np.float32)
matrix2 = np.random.randn(10, 1536).astype(np.float32)
distances = simsimd.cdist(matrix1, matrix2, metric="cosine")
By default, computations use a single CPU core. To optimize and utilize all CPU cores on Linux systems, add the threads=0 argument. Alternatively, specify a custom number of threads:
`py`
distances = simsimd.cdist(matrix1, matrix2, metric="cosine", threads=0)
To view a list of hardware backends that SimSIMD supports:
`py`
print(simsimd.get_capabilities())
Want to use it in Python with USearch?
You can wrap the raw C function pointers SimSIMD backends into a CompiledMetric and pass it to USearch, similar to how it handles Numba's JIT-compiled code.
`py
from usearch.index import Index, CompiledMetric, MetricKind, MetricSignature
from simsimd import pointer_to_sqeuclidean, pointer_to_cosine, pointer_to_inner
metric = CompiledMetric(
pointer=pointer_to_cosine("f16"),
kind=MetricKind.Cos,
signature=MetricSignature.ArrayArraySize,
)
index = Index(256, metric=metric)
`
To install, choose one of the following options depending on your environment:
- npm install --save simsimdyarn add simsimd
- pnpm add simsimd
- bun install simsimd
-
The package is distributed with prebuilt binaries for Node.js v10 and above for Linux (x86_64, arm64), macOS (x86_64, arm64), and Windows (i386,x86_64).
If your platform is not supported, you can build the package from source via npm run build. This will automatically happen unless you install the package with --ignore-scripts flag or use Bun.
After you install it, you will be able to call the SimSIMD functions on various TypedArray variants:
`js
const { sqeuclidean, cosine, inner, hamming, jaccard } = require('simsimd');
const vectorA = new Float32Array([1.0, 2.0, 3.0]);
const vectorB = new Float32Array([4.0, 5.0, 6.0]);
const distance = sqeuclidean(vectorA, vectorB);
console.log('Squared Euclidean Distance:', distance);
`
For integration within a CMake-based project, add the following segment to your CMakeLists.txt:
`cmake`
FetchContent_Declare(
simsimd
GIT_REPOSITORY https://github.com/ashvardanian/simsimd.git
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(simsimd)
If you're aiming to utilize the _Float16 functionality with SimSIMD, ensure your development environment is compatible with C 11.
For other functionalities of SimSIMD, C 99 compatibility will suffice.
A minimal usage example would be:
`c
#include
int main() {
simsimd_f32_t vector_a[1536];
simsimd_f32_t vector_b[1536];
simsimd_f32_t distance = simsimd_avx512_f32_cos(vector_a, vector_b, 1536);
return 0;
}
`
All of the functions names follow the same pattern: simsimd_{backend}_{type}_{metric}.
- The backend can be avx512, avx2, neon, or sve.f64
- The type can be , f32, f16, i8, or b8.cos
- The metric can be , ip, l2sq, hamming, jaccard, kl, or js.
In case you want to avoid hard-coding the backend, you can use the simsimd_metric_punned_t to pun the function pointer, and simsimd_capabilities function to get the available backends at runtime.
__To rerun experiments__ utilize the following command:
`sh`
cmake -DCMAKE_BUILD_TYPE=Release -DSIMSIMD_BUILD_BENCHMARKS=1 -B ./build_release
cmake --build build_release --config Release
./build_release/simsimd_bench
./build_release/simsimd_bench --benchmark_filter=js
__To test and benchmark with Python bindings__:
`sh
pip install -e .
pytest python/test.py -s -x
pip install numpy scipy scikit-learn # for comparison baselines
python python/bench.py # to run default benchmarks
python python/bench.py --n 1000 --ndim 1000000 # batch size and dimensions
`
__To test and benchmark JavaScript bindings__:
`sh`
npm install --dev
npm test
npm run bench
__To test and benchmark GoLang bindings__:
`sh``
cd golang
go test # To test
go test -run=^$ -bench=. -benchmem # To benchmark