Recently I was working on doing a fair comparison between CUDA and OpenCL on a Tesla M2050 GPU. For this, I was looking at the intermediate PTX generated for the two environments.
Having become pretty familiar with OpenCL I decided to port some code to CUDA to see how it compares performance-wise.
Environment Spec:
Processor: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Memory: 23.5GB
Card: Tesla M2050
NVIDIA Driver Version: 285.05.33
Background
A Closer Look at the PTX
PTX is the intermediate code you can have OpenCL or CUDA (nvcc) spit out during compilation of raw source. It lets you have a look at what the compiler is doing. You can also modify the PTX by hand and load it back in. I might write about this later.
Here I'm looking at the PTX output from NVCC. Looking only at the header for now.
Target Architecture
First, lets see what the compiler does automatically with no flags...
.version 1.4
.target sm_10, map_f64_to_f32
// compiled with /usr/local/gpu/cuda-toolkit-4.1.28/cuda/open64/lib//be
// nvopencc 4.1 built on 2012-01-12
//-----------------------------------------------------------
// Compiling /tmp/tmpxft_00003e20_00000000-9_d2q9.cpp3.i (/tmp/ccBI#.FdyONv)
//-----------------------------------------------------------
//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
// Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64
// -O3 (Optimization level)
// -g0 (Debug level)
// -m2 (Report advisories)
//-----------------------------------------------------------
The second line tells us something interesting, the compiler appears to have chosen the sm_10 architecture, this however is not correct! We're using an M2050 GPU which, unless I'm mistaken, is of
compute capability 2.0. In which case, we need to be more careful about the compiler flags! Let's remedy that.
Initially, with the basic compilation my simulation was running in 2.57s. Now, when I switch to sm_20 (-arch=sm_20) I am slowed down by 15% to 3.03s!
Let's have a closer look at the code to see what's changed between using default sm_10 and override with sm_20.
The two codes look completely different. The original code comes with the header above whilst the sm_20 code looks like this...
//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Thu Jan 12 22:46:01 2012 (1326408361)
// Cuda compilation tools, release 4.1, V0.2.1221
//
.version 3.0
.target sm_20
.address_size 64
.file 1 "/tmp/tmpxft_000021b9_00000000-9_d2q9.cpp3.i"
.file 2 "d2q9.cu"
.file 3 "/usr/local/gpu/cuda-toolkit-4.1.28/cuda/bin/../include/device_functions.h"
.file 4 "/usr/local/gpu/cuda-toolkit-4.1.28/cuda/nvvm/ci_include.h"
// __cuda_local_var_17168_35_non_const_sdata has been demoted
Despite using the same nvcc, it appears to be using the version 3.0 compiler rather than the 1.4 above. It looks like if you specify a specific architecture, you are pointed to CUDA's new LLVM based compiler where the performance is not as good.
See below for the pastebins of the two PTXs.
The last post in this
thread sheds a bit more light on the issue however, from what I've found above, I would caution you to check with and without the compiler flags! Either test the performance or at least have a look at what the compiler is spitting out.
For example, even in the faster version (without compiler flags) you still see things such as the following which seem rather interesting. Can you not load directly into an even numbered f32?
ld.global.f32 %f1, [%rd4+0];
mov.f32 %f2, %f1;
ld.global.f32 %f3, [%rd4+4];
mov.f32 %f4, %f3;
ld.global.f32 %f5, [%rd4+8];
mov.f32 %f6, %f5;
ld.global.f32 %f7, [%rd4+12];
mov.f32 %f8, %f7;
ld.global.f32 %f9, [%rd4+16];
mov.f32 %f10, %f9;
ld.global.f32 %f11, [%rd4+20];
mov.f32 %f12, %f11;
ld.global.f32 %f13, [%rd4+24];
mov.f32 %f14, %f13;
ld.global.f32 %f15, [%rd4+28];
mov.f32 %f16, %f15;
ld.global.f32 %f17, [%rd4+32];
mov.f32 %f18, %f17;