Random Posts

Friday, 27 July 2012

Quickly and Nicely Compress Your JPG

Just on my holidays and wanted to compress my JPGs under Linux to put them on Dropbox. Google really didn't yield anything quick 'n' dirty so I adopted the following:

mogrify -quality 75 *.jpg

Clearly you can modify the quality parameter to your choosing. You can simply navigate to the directory containing images, run that command that will convert all your JPGs to a lower quality (remember to back them up first).

Mogrify function is part of the ImageMagick kit so you might need to install that if that command is not found.

Wednesday, 25 July 2012

Where have you gone?

I'm sure many of you are hanging on tenterhooks for my next blog post. To be brief, I have just graduated and am taking a (well earned?) break. Have been travelling around Eastern Europe by train and now currently in the south of Spain.

Therefore, I haven't had much of a chance to keep up on news though someone did send me this which was definitely pretty big news and interesting comparison. It's a comparison of video transcoding using the two services. Amazon wins but I think it's pretty likely the gap will narrow shortly. Google to compete with Amazon's EC2 is big!...see my previous post for a look at the Amazon GPU infrastructure. When I get the time I will perform a comparison with the Google GPU infrastructure if it ever becomes available.

Monday, 2 July 2012

Getting Started with A GPU Cluster Instance On AWS

I recently received some free credit for Amazon Web Services (AWS) so I thought I may as well try it out! AWS is an absolutely incredible resource at a ridiculously low price. More specifically, I mean EC2 (Elastic Compute Cloud) The stats that Amazon come out with about it are insane, for example this snippet from the HPC page:
...a 1064 instance (17024 cores) cluster of cc2.8xlarge instances was able to achieve 240.09 TeraFLOPS for the High Performance Linpack benchmark, placing the cluster at #42 in the November 2011 Top500 list.
Pretty impressive!

I decided that it would be interesting to have a play with the GPU instances that one can use on AWS EC2. With each of these you get 2xTesla M2050 GPUs for very good value for money (though relatively expensive compared to other instance types) at $2 per hour.

So, I created my instance using the wizard, most of this is click through common sense stuff. The only thing to be careful with is to select the correct AMI on the first page: chose the one with GPU in the title!
Then the rest is just a matter of clicking through the wizard and downloading your *.pem file for passwordless ssh login (you need to do chmod 400 file.pem to the newly downloaded file to be able to use it):
ssh -i file.pem ec2-user@server-ip

You can get your server IP from the EC2 management console interface under "Public DNS".

OK, so at this stage I'm logged into my GPU instance. The first thing I run is nvidia-smi whereby I am greeted with the following message:
NVIDIA: could not open the device file /dev/nvidiactl (No such file or directory).
Nvidia-smi has failed because it couldn't communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.

Hmm, that's not very friendly!
After some digging, it turns out I had been using the wrong instance type! During my rapid clicking, I should have selected cg1.4xlarge rather than cc1.4xlarge in one of the dialogs:

Rookie error.
After that, everything seems to be running as normal. Now to run some code.

Compiling

So once we're logged in with nvidia-smi up and running it's time to start compiling some code. This requires a couple of things such as the headers and the libs!
After some scraping around, the CUDA headers and libs can be found here and here:

/opt/nvidia/cuda/include/ 

/opt/nvidia/cuda/lib64/

While the OpenCL headers and libs can be found here and here:

/opt/nvidia/cuda/include/CL/ 

/usr/lib64/

From there, it's plain sailing. The performance is as good as I've ever seen it (For example, I don't know if they are perhaps virtualising the GPUs across multiple users? It seems not.)

Happy GPGPUing!

Sunday, 1 July 2012

Using Embedded Webcam on Sony VAIO VGN-SZ61MN

A non GPU related post for a change. Was having trouble using the embedded webcam in my VAIO laptop under Ubuntu. After a bit of googling I found this thread which then leads to this page.

This only works if you have a Ricoh webcam. You can find this out by running lsusb in the terminal and see if it emits any lines with the word "Ricoh" in. I have no idea why Ricoh and Sony are using non standard webcam APIs but that's another story.

In short, run the following and you will have your Ricoh camera running on your VAIO:

sudo add-apt-repository ppa:r5u87x-loader/ppa
sudo apt-get update
sudo apt-get install r5u87x
sudo /usr/share/r5u87x/r5u87x-download-firmware.sh

Now I can finally get round to that skype video call. Cheers Linux.

Friday, 29 June 2012

Memory Access Ordering in OpenCL

Just a short one today. When I was modifying the NVIDIA PTX emitted by OpenCL for my code, I noticed the following behaviour.

What I'm doing is, in each workitem I am reading a single t_speed (see the struct below) from an array of t_speeds.

typedef struct 
{
    float speeds[9];
} t_speed;

As you might expect there are 9 reads from global memory (one for each float). However, what is interesting is the addresses of the reads from global memory. The PTX reads the data from the higher addresses first and then works its way down from register 14 +32 to the register 14. Very bizarre.

If anyone could explain this I would be very interested to know.

ld.global.f32   %f190, [%r14+32];
ld.global.f32   %f8, [%r14+28];
ld.global.f32   %f7, [%r14+24];
ld.global.f32   %f187, [%r14+20];
ld.global.f32   %f5, [%r14+16];
ld.global.f32   %f4, [%r14+12];
ld.global.f32   %f194, [%r14+8];
ld.global.f32   %f195, [%r14+4];
ld.global.f32   %f200, [%r14];

(I know PTX isn't actual NVIDIA assembly language but just an intermediary before SASS, so perhaps this is just a quirk of the intermediate representation?)

Emitting and Reading OpenCL Binary

As briefly explained in a previous comment, it is possible to modify the 'binary' emitted by the OpenCL runtime for your chosen platform.

For clarity, I've omitted quite a lot of error checking here (for example after compilation) so you'll need to add that in yourself!

Write a Binary

Firstly, lets open a normal kernel source file (in the variable srcFile) and load it into memory. Then build that as per usual. We then have to calculate the binary size and use clGetProgramInfo to write the binary into a variable. Finally we emit that to a file.

FILE *fIn = fopen(srcFile, "r");

// Error check the fIn here
// get the size
fseek(fIn, 0L, SEEK_END);
size_t sz = ftell(fIn);    
rewind(fIn);
char *file = (char*)malloc(sizeof(char)*sz+1);

fread(file, sizeof(char), sz, fIn);
const char* cfile = (const char*)file;
*m_cpProgram = clCreateProgramWithSource(*m_ctx, 1, &cfile, 
                                          &sz, &ciErrNum);
ciErrNum = clBuildProgram(*m_cpProgram, 1, (const cl_device_id*)m_cldDevices, 
                          compilerFlags, NULL, NULL);  

// Calculate how big the binary is
ciErrNum = clGetProgramInfo(*m_cpProgram, CL_PROGRAM_BINARY_SIZES, sizeof(size_t),
                            &kernel_length, 
                            NULL);

unsigned char* bin;
bin = (char*)malloc(sizeof(char)*kernel_length);
ciErrNum = clGetProgramInfo(*m_cpProgram, CL_PROGRAM_BINARIES, kernel_length, &bin, NULL);

// Print the binary out to the output file
fp = fopen(strcat(srcFile,".bin"), "wb");    
fwrite(bin, 1, kernel_length, fp);
fclose(fp);

Read a Binary

The following extract can be used to build a program from a binary file, emitted or modified from a previous OpenCL run...

// Based on the example given in the opencl programming guide
FILE *fp = fopen("custom_file.ptx", "rb");
if (fp == NULL) return -1;    
fseek(fp, 0, SEEK_END);
int kernel_length = ftell(fp);    
rewind(fp);
unsigned char *binary = (unsigned char*)malloc(sizeof(unsigned char)*kernel_length+10);
fclose(fp);   
cl_int clStat;
*m_cpProgram = clCreateProgramWithBinary(*m_ctx, 1, (const cl_device_id*)m_cldDevices,
                                         &kernel_length, 
                                         (const unsigned char**)&binary, &clStat, &ciErrNum);
// Put an error check for ciErrNum here
ciErrNum = clBuildProgram(*m_cpProgram, 1, (const cl_device_id*)m_cldDevices, 
                          NULL, NULL, NULL);

If you have an OpenCL program where the performance of the total program is important (rather than the kernel itself) then the use of precompiled kernels provides a small but noticeable performance advantage since fewer stages of compilation have to be performed at runtime.

Wednesday, 20 June 2012

NVIDIA Talking about CARMA (Tegra 3) at ISC

Yesterday at ISC I sat in another session from NVIDIA, this time on CARMA (aka CUDA for ARM or Tegra 3) by Don Becker (of Beowulf fame). This is NVIDIA looking to target developers with a very low power SoC. The board runs at a peak of 48W of LINPACK giving a single precision performance of 200GFLOPS. We were told that under normal loads the board only requires at most 25W which is pretty impressive. In its current incarnation it consists of an ARM A9 CPU attached via x4 PCIe (< PCIe 2) to a Quadro 1000M - a Fermi based laptop graphics card with 96 CUDA cores. This, coupled with 2GB of RAM. Interestingly, this is done via an MXM connection which means in the future you will most likely be able to hotswap the GPU out for another. Much more flexibility for your usage scenario. The board has an incredible array of connectors including 2x HDMI, 2x ethernet, SATA and video in to name just a few. Gives you a lot of flexibility those does make the board itself relatively large. You can of course boot it from the network or from the SD card.

Initially it will be running CUDA 4.2 (downgraded from CUDA 5 recently for some reason) on an Ubuntu 11.04 bases OS using the 3.1.10 Linux kernel.

It was murmured that in the future they might looking at fusing the two memory regions of the CPU and GPU (related to project denver) which would be awesome. Along with this, they might look at supporting ARMv8 64bit.

Basically looking at a much beefier raspberry pi..despite the price I think I would prefer this! Unfortunately OpenCL support would however have to be driven by the community and will not be done by NVIDIA (apparently).

Priced at $629 from seco - you can sign up for one now.