Friday, 29 June 2012

Memory Access Ordering in OpenCL

Just a short one today. When I was modifying the NVIDIA PTX emitted by OpenCL for my code, I noticed the following behaviour.

What I'm doing is, in each workitem I am reading a single t_speed (see the struct below) from an array of t_speeds.

typedef struct 
{
    float speeds[9];
} t_speed;
 
As you might expect there are 9 reads from global memory (one for each float). However, what is interesting is the addresses of the reads from global memory. The PTX reads the data from the higher addresses first and then works its way down from register 14 +32 to the register 14. Very bizarre.

If anyone could explain this I would be very interested to know.

ld.global.f32   %f190, [%r14+32];
ld.global.f32   %f8, [%r14+28];
ld.global.f32   %f7, [%r14+24];
ld.global.f32   %f187, [%r14+20];
ld.global.f32   %f5, [%r14+16];
ld.global.f32   %f4, [%r14+12];
ld.global.f32   %f194, [%r14+8];
ld.global.f32   %f195, [%r14+4];
ld.global.f32   %f200, [%r14]; 
 
(I know PTX isn't actual NVIDIA assembly language but just an intermediary before SASS, so perhaps this is just a quirk of the intermediate representation?)

Emitting and Reading OpenCL Binary

As briefly explained in a previous comment, it is possible to modify the 'binary' emitted by the OpenCL runtime for your chosen platform.

For clarity, I've omitted quite a lot of error checking here (for example after compilation) so you'll need to add that in yourself!

Write a Binary

Firstly, lets open a normal kernel source file (in the variable srcFile) and load it into memory. Then build that as per usual. We then have to calculate the binary size and use clGetProgramInfo to write the binary into a variable. Finally we emit that to a file.
FILE *fIn = fopen(srcFile, "r");

// Error check the fIn here
// get the size
fseek(fIn, 0L, SEEK_END);
size_t sz = ftell(fIn);    
rewind(fIn);
char *file = (char*)malloc(sizeof(char)*sz+1);

fread(file, sizeof(char), sz, fIn);
const char* cfile = (const char*)file;
*m_cpProgram = clCreateProgramWithSource(*m_ctx, 1, &cfile, 
                                          &sz, &ciErrNum);
ciErrNum = clBuildProgram(*m_cpProgram, 1, (const cl_device_id*)m_cldDevices, 
                          compilerFlags, NULL, NULL);  

// Calculate how big the binary is
ciErrNum = clGetProgramInfo(*m_cpProgram, CL_PROGRAM_BINARY_SIZES, sizeof(size_t),
                            &kernel_length, 
                            NULL);

unsigned char* bin;
bin = (char*)malloc(sizeof(char)*kernel_length);
ciErrNum = clGetProgramInfo(*m_cpProgram, CL_PROGRAM_BINARIES, kernel_length, &bin, NULL);

// Print the binary out to the output file
fp = fopen(strcat(srcFile,".bin"), "wb");    
fwrite(bin, 1, kernel_length, fp);
fclose(fp);

Read a Binary

The following extract can be used to build a program from a binary file, emitted or modified from a previous OpenCL run...

// Based on the example given in the opencl programming guide
FILE *fp = fopen("custom_file.ptx", "rb");
if (fp == NULL) return -1;    
fseek(fp, 0, SEEK_END);
int kernel_length = ftell(fp);    
rewind(fp);
unsigned char *binary = (unsigned char*)malloc(sizeof(unsigned char)*kernel_length+10);
fclose(fp);   
cl_int clStat;
*m_cpProgram = clCreateProgramWithBinary(*m_ctx, 1, (const cl_device_id*)m_cldDevices,
                                         &kernel_length, 
                                         (const unsigned char**)&binary, &clStat, &ciErrNum);
// Put an error check for ciErrNum here
ciErrNum = clBuildProgram(*m_cpProgram, 1, (const cl_device_id*)m_cldDevices, 
                          NULL, NULL, NULL); 
 
If you have an OpenCL program where the performance of the total program is important (rather than the kernel itself) then the use of precompiled kernels provides a small but noticeable performance advantage since fewer stages of compilation have to be performed at runtime.

Wednesday, 20 June 2012

NVIDIA Talking about CARMA (Tegra 3) at ISC




Yesterday at ISC I sat in another session from NVIDIA, this time on CARMA (aka CUDA for ARM or Tegra 3) by Don Becker (of Beowulf fame). This is NVIDIA looking to target developers with a very low power SoC. The board runs at a peak of 48W of LINPACK giving a single precision performance of 200GFLOPS. We were told that under normal loads the board only requires at most 25W which is pretty impressive. In its current incarnation it consists of an ARM A9 CPU attached via x4 PCIe (< PCIe 2) to a Quadro 1000M - a Fermi based laptop graphics card with 96 CUDA cores. This, coupled with 2GB of RAM. Interestingly, this is done via an MXM connection which means in the future you will most likely be able to hotswap the GPU out for another. Much more flexibility for your usage scenario. The board has an incredible array of connectors including 2x HDMI, 2x ethernet, SATA and video in to name just a few. Gives you a lot of flexibility those does make the board itself relatively large. You can of course boot it from the network or from the SD card.


Initially it will be running CUDA 4.2 (downgraded from CUDA 5 recently for some reason) on an Ubuntu 11.04 bases OS using the 3.1.10 Linux kernel.

It was murmured that in the future they might looking at fusing the two memory regions of the CPU and GPU (related to project denver) which would be awesome. Along with this, they might look at supporting ARMv8 64bit.

Basically looking at a much beefier raspberry pi..despite the price I think I would prefer this! Unfortunately OpenCL support would however have to be driven by the community and will not be done by NVIDIA (apparently).


Priced at $629 from seco - you can sign up for one now.

Knights Corner becomes Phi...

GPU Science

Tuesday, 19 June 2012

NVIDIA Talking about Kepler at ISC

I am at the International Supercomputing conference this year and have just been sat in on an NVIDIA satellite event "Inside Kepler" presented by Thomas Bradley.

The talk introduced some of the key features of Kepler and explained some extra things that I hadn't previously picked up on in the NVIDIA publicity nor the whitepaper at first glance.

Initially outlined were the SM->SMX change down to the core level and how the effect of halving the clock has helped performance/watt. All fairly standard stuff.

Secondly, the dynamic parallelism functionality was introduced allowing for 32 levels of recursion and work enqueuing from the device. This is the coolest feature I think with Kepler though I doubt it will make its way into OpenCL any time soon (since I am clearly an OpenCL fanboy). In fact throughout the entirety of the talk there was not a single mention of OpenCL which was kind of sad. Though I guess it is good that there is some competition in that space since if AMD wants to push into HPC they will be driving the development of OpenCL (or maybe HSA?) to match that of CUDA.

The hyperqueueing allowing for streams of commands (pretty much the same as OpenCL command queues) was pretty interesting too. Allowing multiple MPI processes to operate on a single GPU.

Finally and most excitingly was the newest instructions available for shuffling and also the improvements for the performance of the atomic functions. The shuffling allows for intra-warp communication between threads. This is very exciting since, perhaps using the "butterfly shuffle" form, you can essentially do halo exchange between threads in a warp. Obviously, you would still have to perform global synchronisation between iterations of your lattice Boltzmann simulation (for example) if it operates across warps. However, you can achieve this pretty easily now using dynamic parallelism!


Finally, it will be interesting to see if the atomic operations (global reduction instructions) performance enhancements (up to 5x for the slowest) have been extended into OpenCL i.e. is it a new set of instructions or is it a lower level enhancement for the same old instruction? It will be pretty clear fairly instantly which is the case as soon as I get my hands on Kepler.

Thursday, 14 June 2012

AMD SDK Beating Intel at its Own Game

Interesting post from phoronix about how the AMD SDK running on Ivy Bridge actually beats the Intel implementation! Beating Intel by a a third in some cases (see third page for benchmark results). Seen similar performance in the past but the gap doesn't often last long if continually updating drivers.

Tuesday, 12 June 2012

Tweaking OpenCL PTX to Match CUDA

As I've mentioned previously, I've been comparing OpenCL and CUDA in an attempt at a fair test. Here, I provide a broad overview of the differences in the emitted PTX of the two.
(I know it would be useful to paste the source for these PTX however I can't as it would make some coursework very easy for people in the future!)

Firstly, the chart above shows how the two frameworks generate differing counts of instructions. OpenCL yielding a couple more adds while CUDA giving us a few more movs.
So where do these differing instructions creep in? The source code is identical. (I'll mention at this point I am using NO compiler flags with either).  When looking at the PTX code generated by the NVIDIA OpenCL compiler, one notices the odd slightly unexpected instruction cropping up here there: the two codes are almost identical for the most part except, in a couple of places, for things like:
add.f32         %f201, %f164, 0f00000000;
Which seems a little odd, why not use a mov? I don't know enough about the low level workings of GPUs but this seems bizarre. Surely it is faster and simpler to move one value to another register using the following...
mov.f32 %f201, %f164;
This has a simple read and write rather than two reads, a floating point addition followed by a write. Very bizarre. This accounts for the differences in the chart above.

I go through the OpenCL code replacing all the excessive adds with movs and find that it improves the running time! Knocking roughly .1 of a second off. Not too bad really for a little tweak. Though was it really worth the effort?

This shows (albeit not very scientifically) that if you are looking for the ultimate speedup with this combination of tools it is worth having a peek a the OpenCL PTX binary. You can modify this and load it back in. Obviously this isn't ideal if you kernel changes a lot but worth doing if your kernel is a write once type affair.

What I later tried to do was copy pretty much the entirety of the CUDA PTX into OpenCL. This, however, did not work for a variety of reasons so I quickly left it: something to come back to in the future.

The performance of OpenCL can be matched with the CUDA performance fairly easily and, furthermore, OpenCL destroys the LLVM compiler implemented in CUDA when using sm_20 flag with the same particular environment I used in my previous post. Though, in the future, this will most likely be corrected. Still, as ever, be careful when you use the flags...always do a couple of sanity checks!

Weird NVCC Behaviour: Performance of NVVM

Recently I was working on doing a fair comparison between CUDA and OpenCL on a Tesla M2050 GPU. For this, I was looking at the intermediate PTX generated for the two environments.

http://www.nvidia.in/docs/IO/92138/header_productshot1.png
NVIDIA Tesla M2050
Having become pretty familiar with OpenCL I decided to port some code to CUDA to see how it compares performance-wise.

Environment Spec:

Processor: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
Memory: 23.5GB
Card: Tesla M2050
NVIDIA Driver Version: 285.05.33

Background

A Closer Look at the PTX

PTX is the intermediate code you can have OpenCL or CUDA (nvcc) spit out during compilation of raw source. It lets you have a look at what the compiler is doing. You can also modify the PTX by hand and load it back in. I might write about this later.
Here I'm looking at the PTX output from NVCC. Looking only at the header for now.

Target Architecture

First, lets see what the compiler does automatically with no flags...
.version 1.4
.target sm_10, map_f64_to_f32
// compiled with /usr/local/gpu/cuda-toolkit-4.1.28/cuda/open64/lib//be
// nvopencc 4.1 built on 2012-01-12

//-----------------------------------------------------------
// Compiling /tmp/tmpxft_00003e20_00000000-9_d2q9.cpp3.i (/tmp/ccBI#.FdyONv)
//-----------------------------------------------------------

//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
//  Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64
//  -O3 (Optimization level)
//  -g0 (Debug level)
//  -m2 (Report advisories)
//-----------------------------------------------------------

The second line tells us something interesting, the compiler appears to have chosen the sm_10 architecture, this however is not correct! We're using an M2050 GPU which, unless I'm mistaken, is of compute capability 2.0. In which case, we need to be more careful about the compiler flags! Let's remedy that.
Initially, with the basic compilation my simulation was running in 2.57s. Now, when I switch to sm_20 (-arch=sm_20) I am slowed down by 15% to 3.03s!

Let's have a closer look at the code to see what's changed between using default sm_10 and override with sm_20.

The  two codes look completely different. The original code comes with the header above whilst the sm_20 code looks like this...

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Thu Jan 12 22:46:01 2012 (1326408361)
// Cuda compilation tools, release 4.1, V0.2.1221
//

.version 3.0
.target sm_20
.address_size 64

    .file    1 "/tmp/tmpxft_000021b9_00000000-9_d2q9.cpp3.i"
    .file    2 "d2q9.cu"
    .file    3 "/usr/local/gpu/cuda-toolkit-4.1.28/cuda/bin/../include/device_functions.h"
    .file    4 "/usr/local/gpu/cuda-toolkit-4.1.28/cuda/nvvm/ci_include.h"
// __cuda_local_var_17168_35_non_const_sdata has been demoted 
 
Despite using the same nvcc, it appears to be using the version 3.0 compiler rather than the 1.4 above. It looks like if you specify a specific architecture, you are pointed to CUDA's new LLVM based compiler where the performance is not as good.

See below for the pastebins of the two PTXs.



The last post in this thread sheds a bit more light on the issue however, from what I've found above, I would caution you to check with and without the compiler flags! Either test the performance or at least have a look at what the compiler is spitting out.  

For example, even in the faster version (without compiler flags) you still see things such as the following which seem rather interesting. Can you not load directly into an even numbered f32?

ld.global.f32     %f1, [%rd4+0];
mov.f32     %f2, %f1;
ld.global.f32     %f3, [%rd4+4];
mov.f32     %f4, %f3;
ld.global.f32     %f5, [%rd4+8];
mov.f32     %f6, %f5;
ld.global.f32     %f7, [%rd4+12];
mov.f32     %f8, %f7;
ld.global.f32     %f9, [%rd4+16];
mov.f32     %f10, %f9;
ld.global.f32     %f11, [%rd4+20];
mov.f32     %f12, %f11;
ld.global.f32     %f13, [%rd4+24];
mov.f32     %f14, %f13;
ld.global.f32     %f15, [%rd4+28];
mov.f32     %f16, %f15;
ld.global.f32     %f17, [%rd4+32];
mov.f32     %f18, %f17;

Shell to Convert m4a to mp3

After having read some hilariously terrible attempts at shell scripts to convert m4a to mp3 I couldn't help but write something concise..


#!/bin/bash

for file in *.m4a; do
    mplayer -ao pcm "$file" -ao pcm:file="${file/m4a/wav}";
    lame --alt-preset 160 "${file/m4a/wav}" "${file/m4a/mp3}";
done

rm *.wav
 
 
Doesn't require multiple shell scripts or extensive string replacement with sed :s

Only requires mplayer and lame. You can used faad instead of mplayer but I didn't have that installed. Simple.

Can we just autofill city and state? Please!

Coming from a country that is not the US where zip/postal codes are hyper specific, it always drives me nuts when you are filling in a form ...