Random Posts: Tweaking OpenCL PTX to Match CUDA

Tuesday, 12 June 2012

Tweaking OpenCL PTX to Match CUDA

As I've mentioned previously, I've been comparing OpenCL and CUDA in an attempt at a fair test. Here, I provide a broad overview of the differences in the emitted PTX of the two.
(I know it would be useful to paste the source for these PTX however I can't as it would make some coursework very easy for people in the future!)

Firstly, the chart above shows how the two frameworks generate differing counts of instructions. OpenCL yielding a couple more adds while CUDA giving us a few more movs.
So where do these differing instructions creep in? The source code is identical. (I'll mention at this point I am using NO compiler flags with either). When looking at the PTX code generated by the NVIDIA OpenCL compiler, one notices the odd slightly unexpected instruction cropping up here there: the two codes are almost identical for the most part except, in a couple of places, for things like:

add.f32         %f201, %f164, 0f00000000;

Which seems a little odd, why not use a mov? I don't know enough about the low level workings of GPUs but this seems bizarre. Surely it is faster and simpler to move one value to another register using the following...

mov.f32 %f201, %f164;

This has a simple read and write rather than two reads, a floating point addition followed by a write. Very bizarre. This accounts for the differences in the chart above.

I go through the OpenCL code replacing all the excessive adds with movs and find that it improves the running time! Knocking roughly .1 of a second off. Not too bad really for a little tweak. Though was it really worth the effort?

This shows (albeit not very scientifically) that if you are looking for the ultimate speedup with this combination of tools it is worth having a peek a the OpenCL PTX binary. You can modify this and load it back in. Obviously this isn't ideal if you kernel changes a lot but worth doing if your kernel is a write once type affair.

What I later tried to do was copy pretty much the entirety of the CUDA PTX into OpenCL. This, however, did not work for a variety of reasons so I quickly left it: something to come back to in the future.

The performance of OpenCL can be matched with the CUDA performance fairly easily and, furthermore, OpenCL destroys the LLVM compiler implemented in CUDA when using sm_20 flag with the same particular environment I used in my previous post. Though, in the future, this will most likely be corrected. Still, as ever, be careful when you use the flags...always do a couple of sanity checks!

2 comments:

Unknown18 June 2012 at 00:14
Can you recommend some more reading material regarding PTX code other than the one provided by PTX specs?

I am aware of the GPUOcelot which can perform PTX optimization, but I am more interested in the "how" it is done. (For instance, your example of mov better than add)
ReplyDelete
Replies

Add comment

Random Posts

Tuesday, 12 June 2012

Tweaking OpenCL PTX to Match CUDA

2 comments:

Can we just autofill city and state? Please!

Pages

Search This Blog