CUDA Stream Compaction & Radix Sort

I implemented the stream compaction and radix sort algorithm in CUDA, following GPU Gems 3 Chapter 39.

Results Showcase

Comparing Different Versions of Stream Compaction

The CPU implementation, both for power-of-two and non-power-of-two sizes, shows a predictable linear increase in runtime as the input size grows. This suggests that the CPU implementation is computation-bound, scaling consistently with the size of the input array. However, for very large inputs, especially beyond size 24, the non-power-of-two runtimes are slightly slower than power-of-two runtimes, which could indicate some inefficiencies in memory handling or alignment issues for irregular sizes. Still, the primary bottleneck here seems to be the inherent computational complexity of the algorithm, as it scales naturally with input size.

For the Naive implementation, the performance bottlenecks are quite clear. The power-of-two version exhibits a steep increase in runtime as the input size grows, becoming particularly inefficient for larger sizes, as seen by the runtime at size 30 (~919 ms). This steep increase suggests that the Naive implementation is computation-bound and inefficiently parallelized, with large overhead in processing each element. Moreover, it might also suffer from poor memory access patterns, as its performance significantly worsens with increased input sizes, indicating that both computation and memory I/O are bottlenecks. The non-power-of-two Naive implementation performs better, which could be attributed to better handling of irregular data sizes, possibly benefiting from more efficient memory management. However, as input sizes increase, this advantage diminishes, and the overall computational inefficiencies of the Naive approach remain evident.

The Work Efficient Slow implementation shows an improvement over the Naive implementation, particularly for larger input sizes. For power-of-two inputs, it performs moderately well, though its runtime increases more slowly than the Naive version, suggesting that it optimizes the computational workload better. However, the slowdown observed at larger sizes implies that memory I/O might still be a bottleneck, as efficient computation becomes less useful if memory access patterns aren't optimized. The non-power-of-two version of Work Efficient Slow performs slightly better, indicating better handling of irregular input sizes, but the performance gap narrows as input sizes grow, suggesting that at larger scales, both computational and memory inefficiencies become more pronounced.

The Work Efficient Fast implementation offers significant performance gains over both the Naive and Slow versions, particularly for larger inputs. For power-of-two sizes, its scaling is far more controlled, and even for larger sizes, it remains competitive with the CPU. This suggests that the Work Efficient Fast implementation has effectively optimized both computation and memory access. It likely uses more efficient memory patterns, reducing the overhead caused by memory I/O, while also optimizing the computational workload, which allows it to remain fast even for larger inputs. The non-power-of-two version of Work Efficient Fast performs similarly, with only marginal degradation for irregular input sizes. This demonstrates that this implementation is relatively well-balanced and doesn't suffer from the same memory I/O bottlenecks that are apparent in the Naive and Slow approaches.

Finally, the Thrust implementation consistently outperforms all other implementations for both power-of-two and non-power-of-two input sizes, particularly for small and medium-sized arrays. It maintains an almost constant runtime for smaller inputs, which indicates that the computational workload is highly optimized and that memory I/O is not a bottleneck, at least for smaller inputs. As input sizes grow, the runtime increases, but it remains the fastest or near the fastest implementation for most input sizes. This suggests that Thrust effectively handles both computation and memory I/O, optimizing for both types of input sizes. However, for the largest input sizes, the Work Efficient Fast implementation surpasses Thrust, indicating that while Thrust is optimized for general efficiency, very large data sets might still cause some memory bottlenecks.

Radix Sort

For the easy case, the input array size is 8. For the other two tests, the input array size is 2^20. The range of values for the normal range test is [0, 50) and [0, 2147483648) for the large range test. The implementation from thrust is used as the reference for correctness. Below are some test results for the implemented radix sort algorithm.

*****************************
** RADIX SORT TESTS **
*****************************
==== Input ====
    [   4   7   2   6   3   5   1   0 ]
==== radix sort thrust, easy case ====
    [   0   1   2   3   4   5   6   7 ]
==== radix sort with work-efficient scan, easy case ====
    [   0   1   2   3   4   5   6   7 ]
    passed 
==== Input ====
    [   0   0   1   1   0   3   0   1   0   2   1   2   0 ...   3   0 ]
==== radix sort thrust ====
    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...   3   3 ]
==== radix sort with work-efficient scan ====
    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...   3   3 ]
    passed 
==== Input ====
    [ 2030340076 367405552 870298961 958594477 1919490608 650582255 502155192 1038962173 812028104 979602990 45151013 1940063754 445481300 ... 1491375275 1238713711 ]
==== radix sort thrust, large range ====
    [ 128 9077 11470 13673 16336 20082 23386 25138 25834 27486 28982 30125 31576 ... 2147481763 2147483586 ]
==== radix sort with work-efficient scan, large range ====
    [ 128 9077 11470 13673 16336 20082 23386 25138 25834 27486 28982 30125 31576 ... 2147481763 2147483586 ]
    passed 

View the Source Code & Full Performance Analysis

Below is the portal to my GitHub repository. I have included a full performance analysis and additional details in my README.md.

References

  1. https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda

Previous
Previous

WebGPU Forward+ and Clustered Deferred Shading

Next
Next

CUDA Boid Simulation