Fastest way to compute cosine similarities of ndarrays

Can someone help me with the fastest way to compute cosine similarities of ndarrays pair. The ndarrays are of length 2048.
I have 10000 such pairs.

In python, I am able to get this under 2-3 minutes.
How can we achieve this in SCALA?

you mean numpy.ndarray or mxnet.ndarray?

if mxnet.ndarray, this might be helpful:
https://gluon-nlp.mxnet.io/_modules/gluonnlp/embedding/evaluation.html#CosineSimilarity

you can use GPU.

I meant mxnet.ndarray

How can we parallelise computing cosine similarities of ndarrays pair in SCALA. The ndarrays are of length 2048.
I have 10000 such pairs.

Here is the solution in python.

You can reproduce it in Scala using the Scala API:
Here are some useful tutorials:

import mxnet as mx
import time

tic = time.time()
first_term = mx.nd.random.uniform(shape=(10000,2048), ctx=mx.gpu())
second_term = mx.nd.random.uniform(shape=(10000,2048), ctx=mx.gpu())

first_term_normalized = first_term / mx.nd.norm(first_term, axis=1, keepdims=1)
second_term_normalized = second_term / mx.nd.norm(second_term, axis=1, keepdims=1)

cosine_similarity = mx.nd.batch_dot(first_term_normalized.expand_dims(axis=1), second_term_normalized.expand_dims(axis=2)).squeeze()
mx.nd.waitall()
print(time.time()-tic)
print(cosine_similarity)

(it takes about ~10ms on GPU)