Speeding up Parallelized Algorithms

A quick snapshot capturing a few of the highlights from an article written by our company intern (Alex) and published on Towards Data Science. The article goes into depth on how we achieve a 1700x speedup on the k-means algorithm written for our proprietary AI model.

<aside> 🔗 Article: bit.ly/cuda-kmeans

</aside>

Batched K-Means with Python Numba and CUDA C

How we took our k-means algorithm from 1 day to less than 1 minute runtime

Utilize thread space efficiently (we used 32 seeds for our k-means, the exact number of threads in a warp)
Use shared memory on the GPU whenever possible (rather than transferring back-and-forth to-and-from device). Keep in mind that on-device memory is limited, though.
Use C/C++ for memory management and direct access. Our C code was 3.4x faster than our Python Numba code.

<aside> 💁‍♀️ Bonus Tip: don’t over-utilize the synchthread() function to avoid deadlocks

</aside>

<aside> 🌐 Link to this page: bit.ly/fast-gpu

</aside>