Re: Where does my code spend time and how to improve this
Posted by Wade Walker on Jan 01, 2019; 11:35pm
URL: https://forum.jogamp.org/Where-does-my-code-spend-time-and-how-to-improve-this-tp4039358p4039362.html
There's a subtlety with the data-dependent branch. If the branch is taken say 10% of the time, and your GPU's warp width is 32, then on average ~3 of the threads in the warp will take the branch, and 29 will not. Since the GPU can't have two different instruction pointers for a single warp, it will first do the threads that took the branch, then the ones that didn't, which essentially doubles the execution time for everything after the branch.
To fix this, you'd probably want to use the built-in OpenCL max()/min() functions. These operate component-wise across a whole warp in a single cycle, without a branch, since they're supported by the hardware execution units.