I am talking about nvidia GPUs with compute capability 3.5 (GK110):
On each GPU core, there are at most 64 active warps, how the hardware handles warp retirements? can warps from different grid blocks or even streams work concurrently on a single core of a GPU (by core I mean multiprocessor, since they are estentially the core of a multi-core GPU)?
Can a core retires finished warps quickly from a stream and load other warps from maybe other stream to the same core?
The reason why I asked this is because there are hard decisions to make:
(1) I can write code to launch many threads from different streams, and about 2/3 of the warps launched will basically do nothing and get quick retirements.
(2) I can write code to launch exactly the number of threads that needed, but each thread will consists of very heavy indexing computations (through solving several indexing equations, the ammount of computation involved to just compute correct index will be just as much as, if not more, the real computation in (1)).
So if GK110 can retire empty warps quickly and replaces them with new warps then (1) will be better than (2) because it can avoid unncessary indexing computation completely.
So far trivial case test show the two works about the same, but I am not sure about whether (1) could be better in non-trival cases.
以上就是Anyone know the mechanism of warp retirement in nvidia products(Kepler)?的详细内容，更多请关注web前端其它相关文章！