Tuesday 12 June 2012

Tweaking OpenCL PTX to Match CUDA

As I've mentioned previously, I've been comparing OpenCL and CUDA in an attempt at a fair test. Here, I provide a broad overview of the differences in the emitted PTX of the two.
(I know it would be useful to paste the source for these PTX however I can't as it would make some coursework very easy for people in the future!)

Firstly, the chart above shows how the two frameworks generate differing counts of instructions. OpenCL yielding a couple more adds while CUDA giving us a few more movs.
So where do these differing instructions creep in? The source code is identical. (I'll mention at this point I am using NO compiler flags with either).  When looking at the PTX code generated by the NVIDIA OpenCL compiler, one notices the odd slightly unexpected instruction cropping up here there: the two codes are almost identical for the most part except, in a couple of places, for things like:
add.f32         %f201, %f164, 0f00000000;
Which seems a little odd, why not use a mov? I don't know enough about the low level workings of GPUs but this seems bizarre. Surely it is faster and simpler to move one value to another register using the following...
mov.f32 %f201, %f164;
This has a simple read and write rather than two reads, a floating point addition followed by a write. Very bizarre. This accounts for the differences in the chart above.

I go through the OpenCL code replacing all the excessive adds with movs and find that it improves the running time! Knocking roughly .1 of a second off. Not too bad really for a little tweak. Though was it really worth the effort?

This shows (albeit not very scientifically) that if you are looking for the ultimate speedup with this combination of tools it is worth having a peek a the OpenCL PTX binary. You can modify this and load it back in. Obviously this isn't ideal if you kernel changes a lot but worth doing if your kernel is a write once type affair.

What I later tried to do was copy pretty much the entirety of the CUDA PTX into OpenCL. This, however, did not work for a variety of reasons so I quickly left it: something to come back to in the future.

The performance of OpenCL can be matched with the CUDA performance fairly easily and, furthermore, OpenCL destroys the LLVM compiler implemented in CUDA when using sm_20 flag with the same particular environment I used in my previous post. Though, in the future, this will most likely be corrected. Still, as ever, be careful when you use the flags...always do a couple of sanity checks!

2 comments:

  1. Can you recommend some more reading material regarding PTX code other than the one provided by PTX specs?

    I am aware of the GPUOcelot which can perform PTX optimization, but I am more interested in the "how" it is done. (For instance, your example of mov better than add)

    ReplyDelete
    Replies
    1. Hi Omar,

      I might post about this in more detail later but you can emit and load 'binary' using OpenCL during the compilation of the kernel. In practice, using the NVIDIA SDK at least, this involves the emission of PTX code.
      Thus, once you have emitted your PTX code from a compilation and had a look at it (and edited it), you can then load it back in at your next run using clCreateProgramWithBinary (in place of clCreateProgramWithSource) which will read your new PTX and allow you to build a kernel object as per usual. ( http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clCreateProgramWithBinary.html )

      Note, having just reread my example scenario with the mov and add, I have updated the registers so that they actually match up correctly!

      Delete

Leave a comment!

Can we just autofill city and state? Please!

Coming from a country that is not the US where zip/postal codes are hyper specific, it always drives me nuts when you are filling in a form ...