Tuesday 14 August 2012

Updating NVIDIA Drivers

Downloading and installing the NVIDIA binary blobs (or drivers as they're known) from the website is an absolute nightmare for Linux. It usually involves killing the xserver and ensuring that you have the right kernel modules matching up.

Anyway, there is a simple way of keeping your NVIDIA display driver up to date by adding the following ppa to your repositories:

sudo apt-add-repository ppa:ubuntu-x-swat/x-updates
sudo apt-get update
sudo apt-get install nvidia-current 
  

Algorithmic Trading in the Headlines

With the spectacular melt down of Knight Capital (resulting in a $440million loss) this week due to a piece of buggy 'dormant' software, the newswires have been alive with talk of HFT and algorithmic trading. Below I pick a selection of the best headlines that are worth a read. How long will this last?


High-frequency trading and the $440m mistake - Tim Harford (BBC News)
Raging Bulls: How Wall Street Got Addicted to Light-Speed Trading - Wired
History of Algorithmic Trading - Bloomberg News (Opinion)

And also a nice little video to explain HFT.

A Brief History of Supercomputers

Interesting infographic from hpc4energy.

AutoDesk Maya Fluid Simulation

Very very cool footage of how Maya can now use OpenCL for fluid simulation - 10x more performance from the GPU compared to the CPU!


Friday 27 July 2012

Quickly and Nicely Compress Your JPG

Just on my holidays and wanted to compress my JPGs under Linux to put them on Dropbox. Google really didn't yield anything quick 'n' dirty so I adopted the following:

mogrify -quality 75 *.jpg 
 
Clearly you can modify the quality parameter to your choosing. You can simply navigate to the directory containing images, run that command that will convert all your JPGs to a lower quality (remember to back them up first). 

Mogrify function is part of the ImageMagick kit so you might need to install that if that command is not found.

Wednesday 25 July 2012

Where have you gone?

I'm sure many of you are hanging on tenterhooks for my next blog post. To be brief, I have just graduated and am taking a (well earned?) break. Have been travelling around Eastern Europe by train and now currently in the south of Spain.

Therefore, I haven't had much of a chance to keep up on news though someone did send me this which was definitely pretty big news and interesting comparison. It's a comparison of video transcoding using the two services. Amazon wins but I think it's pretty likely the gap will narrow shortly. Google to compete with Amazon's EC2 is big!...see my previous post for a look at the Amazon GPU infrastructure. When I get the time I will perform a comparison with the Google GPU infrastructure if it ever becomes available.

Monday 2 July 2012

Getting Started with A GPU Cluster Instance On AWS

I recently received some free credit for Amazon Web Services (AWS) so I thought I may as well try it out! AWS is an absolutely incredible resource at a ridiculously low price. More specifically, I mean EC2 (Elastic Compute Cloud) The stats that Amazon come out with about it are insane, for example this snippet from the HPC page:
...a 1064 instance (17024 cores) cluster of cc2.8xlarge instances was able to achieve 240.09 TeraFLOPS for the High Performance Linpack benchmark, placing the cluster at #42 in the November 2011 Top500 list.
Pretty impressive!

I decided that it would be interesting to have a play with the GPU instances that one can use on AWS EC2. With each of these you get 2xTesla M2050 GPUs for very good value for money (though relatively expensive compared to other instance types) at $2 per hour.

So, I created my instance using the wizard, most of this is click through common sense stuff. The only thing to be careful with is to select the correct AMI on the first page: chose the one with GPU in the title!
Then the rest is just a matter of clicking through the wizard and downloading your *.pem file for passwordless ssh login (you need to do chmod 400 file.pem  to the newly downloaded file to be able to use it):  
ssh -i file.pem ec2-user@server-ip

You can get your server IP from the EC2 management console interface under "Public DNS".

OK, so at this stage I'm logged into my GPU instance. The first thing I run is nvidia-smi whereby I am greeted with the following message:
NVIDIA: could not open the device file /dev/nvidiactl (No such file or directory).
Nvidia-smi has failed because it couldn't communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.

Hmm, that's not very friendly!
After some digging, it turns out I had been using the wrong instance type! During my rapid clicking, I should have selected cg1.4xlarge rather than cc1.4xlarge in one of the dialogs:



Rookie error.
After that, everything seems to be running as normal. Now to run some code.

 

Compiling

So once we're logged in with nvidia-smi up and running it's time to start compiling some code. This requires a couple of things such as the headers and the libs!
After some scraping around, the CUDA headers and libs can be found here and here:
/opt/nvidia/cuda/include/ 
/opt/nvidia/cuda/lib64/

While the OpenCL headers and libs can be found here and here:
/opt/nvidia/cuda/include/CL/
/usr/lib64/

From there, it's plain sailing. The performance is as good as I've ever seen it (For example, I don't know if they are perhaps virtualising the GPUs across multiple users? It seems not.)
Happy GPGPUing!



Sunday 1 July 2012

Using Embedded Webcam on Sony VAIO VGN-SZ61MN

A non GPU related post for a change. Was having trouble using the embedded webcam in my VAIO laptop under Ubuntu. After a bit of googling I found this thread which then leads to this page.

This only works if you have a Ricoh webcam. You can find this out by running lsusb in the terminal and see if it emits any lines with the word "Ricoh" in. I have no idea why Ricoh and Sony are using non standard webcam APIs but that's another story.

In short, run the following and you will have your Ricoh camera running on your VAIO:

sudo add-apt-repository ppa:r5u87x-loader/ppa
sudo apt-get update
sudo apt-get install r5u87x
sudo /usr/share/r5u87x/r5u87x-download-firmware.sh 
 
Now I can finally get round to that skype video call. Cheers Linux.

Friday 29 June 2012

Memory Access Ordering in OpenCL

Just a short one today. When I was modifying the NVIDIA PTX emitted by OpenCL for my code, I noticed the following behaviour.

What I'm doing is, in each workitem I am reading a single t_speed (see the struct below) from an array of t_speeds.

typedef struct 
{
    float speeds[9];
} t_speed;
 
As you might expect there are 9 reads from global memory (one for each float). However, what is interesting is the addresses of the reads from global memory. The PTX reads the data from the higher addresses first and then works its way down from register 14 +32 to the register 14. Very bizarre.

If anyone could explain this I would be very interested to know.

ld.global.f32   %f190, [%r14+32];
ld.global.f32   %f8, [%r14+28];
ld.global.f32   %f7, [%r14+24];
ld.global.f32   %f187, [%r14+20];
ld.global.f32   %f5, [%r14+16];
ld.global.f32   %f4, [%r14+12];
ld.global.f32   %f194, [%r14+8];
ld.global.f32   %f195, [%r14+4];
ld.global.f32   %f200, [%r14]; 
 
(I know PTX isn't actual NVIDIA assembly language but just an intermediary before SASS, so perhaps this is just a quirk of the intermediate representation?)

Emitting and Reading OpenCL Binary

As briefly explained in a previous comment, it is possible to modify the 'binary' emitted by the OpenCL runtime for your chosen platform.

For clarity, I've omitted quite a lot of error checking here (for example after compilation) so you'll need to add that in yourself!

Write a Binary

Firstly, lets open a normal kernel source file (in the variable srcFile) and load it into memory. Then build that as per usual. We then have to calculate the binary size and use clGetProgramInfo to write the binary into a variable. Finally we emit that to a file.
FILE *fIn = fopen(srcFile, "r");

// Error check the fIn here
// get the size
fseek(fIn, 0L, SEEK_END);
size_t sz = ftell(fIn);    
rewind(fIn);
char *file = (char*)malloc(sizeof(char)*sz+1);

fread(file, sizeof(char), sz, fIn);
const char* cfile = (const char*)file;
*m_cpProgram = clCreateProgramWithSource(*m_ctx, 1, &cfile, 
                                          &sz, &ciErrNum);
ciErrNum = clBuildProgram(*m_cpProgram, 1, (const cl_device_id*)m_cldDevices, 
                          compilerFlags, NULL, NULL);  

// Calculate how big the binary is
ciErrNum = clGetProgramInfo(*m_cpProgram, CL_PROGRAM_BINARY_SIZES, sizeof(size_t),
                            &kernel_length, 
                            NULL);

unsigned char* bin;
bin = (char*)malloc(sizeof(char)*kernel_length);
ciErrNum = clGetProgramInfo(*m_cpProgram, CL_PROGRAM_BINARIES, kernel_length, &bin, NULL);

// Print the binary out to the output file
fp = fopen(strcat(srcFile,".bin"), "wb");    
fwrite(bin, 1, kernel_length, fp);
fclose(fp);

Read a Binary

The following extract can be used to build a program from a binary file, emitted or modified from a previous OpenCL run...

// Based on the example given in the opencl programming guide
FILE *fp = fopen("custom_file.ptx", "rb");
if (fp == NULL) return -1;    
fseek(fp, 0, SEEK_END);
int kernel_length = ftell(fp);    
rewind(fp);
unsigned char *binary = (unsigned char*)malloc(sizeof(unsigned char)*kernel_length+10);
fclose(fp);   
cl_int clStat;
*m_cpProgram = clCreateProgramWithBinary(*m_ctx, 1, (const cl_device_id*)m_cldDevices,
                                         &kernel_length, 
                                         (const unsigned char**)&binary, &clStat, &ciErrNum);
// Put an error check for ciErrNum here
ciErrNum = clBuildProgram(*m_cpProgram, 1, (const cl_device_id*)m_cldDevices, 
                          NULL, NULL, NULL); 
 
If you have an OpenCL program where the performance of the total program is important (rather than the kernel itself) then the use of precompiled kernels provides a small but noticeable performance advantage since fewer stages of compilation have to be performed at runtime.

Wednesday 20 June 2012

NVIDIA Talking about CARMA (Tegra 3) at ISC




Yesterday at ISC I sat in another session from NVIDIA, this time on CARMA (aka CUDA for ARM or Tegra 3) by Don Becker (of Beowulf fame). This is NVIDIA looking to target developers with a very low power SoC. The board runs at a peak of 48W of LINPACK giving a single precision performance of 200GFLOPS. We were told that under normal loads the board only requires at most 25W which is pretty impressive. In its current incarnation it consists of an ARM A9 CPU attached via x4 PCIe (< PCIe 2) to a Quadro 1000M - a Fermi based laptop graphics card with 96 CUDA cores. This, coupled with 2GB of RAM. Interestingly, this is done via an MXM connection which means in the future you will most likely be able to hotswap the GPU out for another. Much more flexibility for your usage scenario. The board has an incredible array of connectors including 2x HDMI, 2x ethernet, SATA and video in to name just a few. Gives you a lot of flexibility those does make the board itself relatively large. You can of course boot it from the network or from the SD card.


Initially it will be running CUDA 4.2 (downgraded from CUDA 5 recently for some reason) on an Ubuntu 11.04 bases OS using the 3.1.10 Linux kernel.

It was murmured that in the future they might looking at fusing the two memory regions of the CPU and GPU (related to project denver) which would be awesome. Along with this, they might look at supporting ARMv8 64bit.

Basically looking at a much beefier raspberry pi..despite the price I think I would prefer this! Unfortunately OpenCL support would however have to be driven by the community and will not be done by NVIDIA (apparently).


Priced at $629 from seco - you can sign up for one now.

Knights Corner becomes Phi...

GPU Science

Tuesday 19 June 2012

NVIDIA Talking about Kepler at ISC

I am at the International Supercomputing conference this year and have just been sat in on an NVIDIA satellite event "Inside Kepler" presented by Thomas Bradley.

The talk introduced some of the key features of Kepler and explained some extra things that I hadn't previously picked up on in the NVIDIA publicity nor the whitepaper at first glance.

Initially outlined were the SM->SMX change down to the core level and how the effect of halving the clock has helped performance/watt. All fairly standard stuff.

Secondly, the dynamic parallelism functionality was introduced allowing for 32 levels of recursion and work enqueuing from the device. This is the coolest feature I think with Kepler though I doubt it will make its way into OpenCL any time soon (since I am clearly an OpenCL fanboy). In fact throughout the entirety of the talk there was not a single mention of OpenCL which was kind of sad. Though I guess it is good that there is some competition in that space since if AMD wants to push into HPC they will be driving the development of OpenCL (or maybe HSA?) to match that of CUDA.

The hyperqueueing allowing for streams of commands (pretty much the same as OpenCL command queues) was pretty interesting too. Allowing multiple MPI processes to operate on a single GPU.

Finally and most excitingly was the newest instructions available for shuffling and also the improvements for the performance of the atomic functions. The shuffling allows for intra-warp communication between threads. This is very exciting since, perhaps using the "butterfly shuffle" form, you can essentially do halo exchange between threads in a warp. Obviously, you would still have to perform global synchronisation between iterations of your lattice Boltzmann simulation (for example) if it operates across warps. However, you can achieve this pretty easily now using dynamic parallelism!


Finally, it will be interesting to see if the atomic operations (global reduction instructions) performance enhancements (up to 5x for the slowest) have been extended into OpenCL i.e. is it a new set of instructions or is it a lower level enhancement for the same old instruction? It will be pretty clear fairly instantly which is the case as soon as I get my hands on Kepler.

Thursday 14 June 2012

AMD SDK Beating Intel at its Own Game

Interesting post from phoronix about how the AMD SDK running on Ivy Bridge actually beats the Intel implementation! Beating Intel by a a third in some cases (see third page for benchmark results). Seen similar performance in the past but the gap doesn't often last long if continually updating drivers.

Tuesday 12 June 2012

Tweaking OpenCL PTX to Match CUDA

As I've mentioned previously, I've been comparing OpenCL and CUDA in an attempt at a fair test. Here, I provide a broad overview of the differences in the emitted PTX of the two.
(I know it would be useful to paste the source for these PTX however I can't as it would make some coursework very easy for people in the future!)

Firstly, the chart above shows how the two frameworks generate differing counts of instructions. OpenCL yielding a couple more adds while CUDA giving us a few more movs.
So where do these differing instructions creep in? The source code is identical. (I'll mention at this point I am using NO compiler flags with either).  When looking at the PTX code generated by the NVIDIA OpenCL compiler, one notices the odd slightly unexpected instruction cropping up here there: the two codes are almost identical for the most part except, in a couple of places, for things like:
add.f32         %f201, %f164, 0f00000000;
Which seems a little odd, why not use a mov? I don't know enough about the low level workings of GPUs but this seems bizarre. Surely it is faster and simpler to move one value to another register using the following...
mov.f32 %f201, %f164;
This has a simple read and write rather than two reads, a floating point addition followed by a write. Very bizarre. This accounts for the differences in the chart above.

I go through the OpenCL code replacing all the excessive adds with movs and find that it improves the running time! Knocking roughly .1 of a second off. Not too bad really for a little tweak. Though was it really worth the effort?

This shows (albeit not very scientifically) that if you are looking for the ultimate speedup with this combination of tools it is worth having a peek a the OpenCL PTX binary. You can modify this and load it back in. Obviously this isn't ideal if you kernel changes a lot but worth doing if your kernel is a write once type affair.

What I later tried to do was copy pretty much the entirety of the CUDA PTX into OpenCL. This, however, did not work for a variety of reasons so I quickly left it: something to come back to in the future.

The performance of OpenCL can be matched with the CUDA performance fairly easily and, furthermore, OpenCL destroys the LLVM compiler implemented in CUDA when using sm_20 flag with the same particular environment I used in my previous post. Though, in the future, this will most likely be corrected. Still, as ever, be careful when you use the flags...always do a couple of sanity checks!

Weird NVCC Behaviour: Performance of NVVM

Recently I was working on doing a fair comparison between CUDA and OpenCL on a Tesla M2050 GPU. For this, I was looking at the intermediate PTX generated for the two environments.

http://www.nvidia.in/docs/IO/92138/header_productshot1.png
NVIDIA Tesla M2050
Having become pretty familiar with OpenCL I decided to port some code to CUDA to see how it compares performance-wise.

Environment Spec:

Processor: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
Memory: 23.5GB
Card: Tesla M2050
NVIDIA Driver Version: 285.05.33

Background

A Closer Look at the PTX

PTX is the intermediate code you can have OpenCL or CUDA (nvcc) spit out during compilation of raw source. It lets you have a look at what the compiler is doing. You can also modify the PTX by hand and load it back in. I might write about this later.
Here I'm looking at the PTX output from NVCC. Looking only at the header for now.

Target Architecture

First, lets see what the compiler does automatically with no flags...
.version 1.4
.target sm_10, map_f64_to_f32
// compiled with /usr/local/gpu/cuda-toolkit-4.1.28/cuda/open64/lib//be
// nvopencc 4.1 built on 2012-01-12

//-----------------------------------------------------------
// Compiling /tmp/tmpxft_00003e20_00000000-9_d2q9.cpp3.i (/tmp/ccBI#.FdyONv)
//-----------------------------------------------------------

//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
//  Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64
//  -O3 (Optimization level)
//  -g0 (Debug level)
//  -m2 (Report advisories)
//-----------------------------------------------------------

The second line tells us something interesting, the compiler appears to have chosen the sm_10 architecture, this however is not correct! We're using an M2050 GPU which, unless I'm mistaken, is of compute capability 2.0. In which case, we need to be more careful about the compiler flags! Let's remedy that.
Initially, with the basic compilation my simulation was running in 2.57s. Now, when I switch to sm_20 (-arch=sm_20) I am slowed down by 15% to 3.03s!

Let's have a closer look at the code to see what's changed between using default sm_10 and override with sm_20.

The  two codes look completely different. The original code comes with the header above whilst the sm_20 code looks like this...

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Thu Jan 12 22:46:01 2012 (1326408361)
// Cuda compilation tools, release 4.1, V0.2.1221
//

.version 3.0
.target sm_20
.address_size 64

    .file    1 "/tmp/tmpxft_000021b9_00000000-9_d2q9.cpp3.i"
    .file    2 "d2q9.cu"
    .file    3 "/usr/local/gpu/cuda-toolkit-4.1.28/cuda/bin/../include/device_functions.h"
    .file    4 "/usr/local/gpu/cuda-toolkit-4.1.28/cuda/nvvm/ci_include.h"
// __cuda_local_var_17168_35_non_const_sdata has been demoted 
 
Despite using the same nvcc, it appears to be using the version 3.0 compiler rather than the 1.4 above. It looks like if you specify a specific architecture, you are pointed to CUDA's new LLVM based compiler where the performance is not as good.

See below for the pastebins of the two PTXs.



The last post in this thread sheds a bit more light on the issue however, from what I've found above, I would caution you to check with and without the compiler flags! Either test the performance or at least have a look at what the compiler is spitting out.  

For example, even in the faster version (without compiler flags) you still see things such as the following which seem rather interesting. Can you not load directly into an even numbered f32?

ld.global.f32     %f1, [%rd4+0];
mov.f32     %f2, %f1;
ld.global.f32     %f3, [%rd4+4];
mov.f32     %f4, %f3;
ld.global.f32     %f5, [%rd4+8];
mov.f32     %f6, %f5;
ld.global.f32     %f7, [%rd4+12];
mov.f32     %f8, %f7;
ld.global.f32     %f9, [%rd4+16];
mov.f32     %f10, %f9;
ld.global.f32     %f11, [%rd4+20];
mov.f32     %f12, %f11;
ld.global.f32     %f13, [%rd4+24];
mov.f32     %f14, %f13;
ld.global.f32     %f15, [%rd4+28];
mov.f32     %f16, %f15;
ld.global.f32     %f17, [%rd4+32];
mov.f32     %f18, %f17;

Shell to Convert m4a to mp3

After having read some hilariously terrible attempts at shell scripts to convert m4a to mp3 I couldn't help but write something concise..


#!/bin/bash

for file in *.m4a; do
    mplayer -ao pcm "$file" -ao pcm:file="${file/m4a/wav}";
    lame --alt-preset 160 "${file/m4a/wav}" "${file/m4a/mp3}";
done

rm *.wav
 
 
Doesn't require multiple shell scripts or extensive string replacement with sed :s

Only requires mplayer and lame. You can used faad instead of mplayer but I didn't have that installed. Simple.

Tuesday 29 May 2012

AMD on Linux

Just a short one for once to say, that I couldn't agree more with this and this: the AMD support on Linux really isn't ideal compared to Windows. It's a shame when the software can't keep up with the hardware. Zero copy buffers really should be in Linux! Please! While their Linux support lags behind Windows I will not be buying an APU. At least they do still support Linux unlike some other high profile vendors.

EDIT 31/6/12: Looks like possibly more dark clouds ahead for AMD Linux. Looks like they're only going to be rolling out updates based on game necessity - potentially bad news if your OS doesn't have any games!


All via Phoronix.

Customising Emacs

The default emacs, I must admit, is pretty ugly and basic. I use a customisation script at each run - as is common - to customise it to my taste (similar in a way to your familiar .bashrc file). Customisations in emacs are done by way of a socalled '.emacs' file. This is basically a file in your home directory that gets run every time you boot up emacs. I use it to set a nice colour theme, line numbering and some handy shortcuts.

When I refer to the emacs include directory I mean that I have a directory in my home folder called .emacs.d which contains all the *.el files that are elisp scripts for large customisations written by others. The first thing I would recommend doing is add the following to your .emacs file to have emacs look in that directory for extensions:

(add-to-list 'load-path "~/emacs.d/")

Next, I quite like having line numbering for the files I'm editing so I can reference around the file quickly.
Put the following file in your emacs include directory then add the following lines to your .emacs:
(require 'linum)
(global-linum-mode 1)

Next, is to make it look pretty with some colo(u)r themes. I'm not going to rehash the excellent instructions provided for you here but it pretty much follows the standard format. FWIW I use color-theme-calm-forest but there are plenty in there for you to explore.

Showing whitespace is a useful feature to ensure you maintain tidy code. This can be enabled with a simple m-x command to show the whitespace. Futher instruction here.

To switch the cursor between the visible buffers (e.g. in a split screen) with the C-x <direction> command, add the following couple of lines to your .emacs file: it basically remaps the commands to some nice intuitive combinations. Very useful if you don't ever want to have to use the mouse when editing text.

(global-set-key (kbd "C-x <up>") 'windmove-up)
(global-set-key (kbd "C-x <down>") 'windmove-down)
(global-set-key (kbd "C-x <right>") 'windmove-right)
(global-set-key (kbd "C-x <left>") 'windmove-left)

In a similar vain, you can set some compile and recompile hotkeys (to save you typing m-x compile or m-x recompile each time)...
(global-set-key [(f9)] 'compile)
(global-set-key [(f10)] 'recompile)

When you first get into emacs you'll notice that the scrolling isn't as nice as it could be, you can use the following couple of lines to soup it up so it's much more smoother and friendlier:
(defun smooth-scroll (increment)
  (scroll-up increment) (sit-for 0.05)
  (scroll-up increment) (sit-for 0.02)
  (scroll-up increment) (sit-for 0.02)
  (scroll-up increment) (sit-for 0.05)
  (scroll-up increment) (sit-for 0.06)
  (scroll-up increment))

(global-set-key [(mouse-5)] '(lambda () (interactive) (smooth-scroll 1)))
(global-set-key [(mouse-4)] '(lambda () (interactive) (smooth-scroll -1))) 
 
So you have line numbers from earlier, now you might want to jump to a line with the familiar ctrl-l command like in eclipse, for example. Simple, remap the goto-line command to c-l.

(global-set-key "\C-l" 'goto-line) 
 
I'm quite a stickler for nice indentation even on my own code. More specifically, I prefer spaces rather than tabs (4 spaces) and I prefer newlines for my curly braces. More can be found on this style, known as allman style. This was used extensively in BSD so you can set that parameter just below along with the others.
 
(setq-default default-tab-width 4)
(setq-default  c-basic-offset 4)
(setq-default  c-default-style "bsd")
 
 
And finally, for fun, you can set your frame title :)
(setq frame-title-format "%b - [Your name here]'s Emacs" buffer-file-name)


You might also find it useful to configure autocomplete, though I find this can be slightly irritating at times.

Hopefully that's enough to get you going with making emacs more friendly for you. As is usual, you can often find what you want via google which generally takes you to the emacswiki.

Sunday 27 May 2012

Visualising MPI Communication With MPE

Visualising with MPE


A couple of weeks ago I had the task of accelerating some fluid dynamics code using MPI (message passing interface) which uses messages (as you might have guessed) to help parallelise code (versus OpenMP which uses a shared memory model). This was done across four nodes each containing two dual core CPUs (so 8 nodes) and achieved a speedup of 11x which I was fairly pleased with.

The most interesting part of the project was visualising the communication between the nodes. To do this myself would have been nigh-on impossible: dealing with disparate nodes across a large system each with their own timings it would be impossible to synchronise the communications and logging correctly to collect and visualise later.

Fortunately for me there exists the MPE project which supplies performance visualisations for MPI.

Once having configured MPE successfully on the cluster, with MPE is was able to add a couple of lines to my code (see below) to tell it that I wanted a visualisation.

There are two ways of doing this: the first is to have MPE log the images to disk for you to assemble later (using xwd) whilst the second way is to have it spawn a window to show you it in real time. Since this simulation would require around 10,000 simulations and I didn't fancy stitching together so many images, I decided to use the second option: spawn a window via X-forwarding and do a screen cap. Yes, hacky and horrible I know!

Anyway, here is the outline of the code I had to add to get it up and running. As you can see it's all fairly simple and doesn't interfere with any of the pre-existing MPI code. Hats off to ANL for this, it's very useful.

#include "mpe.h"
#include "mpe_graphics.h"

//...the start of your program here and initialise your MPI env. as per usual

MPE_XGraph graph;

// Open the MPE graphics window of size 400x400 at  600,-1
int ierr = MPE_Open_graphics( &graph, MPI_COMM_WORLD,NULL,
                              600, -1, 400, 400, 0 );

// Alternatively you can set the capture file...
//ierr = MPE_CaptureFile(graph, "outputimage", 1);

if(ierr != MPE_SUCCESS)
{
    printf("Error Launching X world\n");
    MPI_Abort(MPI_COMM_WORLD, 1);
    exit(1);
}

//...skip out some code here and we enter the main body of our MPI code.

MPI_Isend(&cells[ topload    ], NX, structTypeTop, (rank+1) % size,       1, MPI_COMM_WORLD, &request);
MPI_Isend(&cells[ bottomload ], NX, structTypeBot, (rank+size-1) % size,  2, MPI_COMM_WORLD, &request);        

// This forces our graph to update after having seen the most recent comms
ierr = MPE_Update( graph );    
MPI_Recv(&cells [ bottomloadR], NX, structTypeTop, (rank+size-1) %size,   1, MPI_COMM_WORLD, &status);
MPI_Recv(&cells [ toploadR   ], NX, structTypeBot, (rank+1)%size,         2, MPI_COMM_WORLD, &status);
// Again, after receiving let's update our graph
ierr = MPE_Update( graph ); 

// ... continue looping round. When we are finished with MPI let's sync our clocks..
MPE_Log_sync_clocks();
MPE_Finish_log(argv[0]);
// And also close our graphics
MPE_Close_graphics(&graph);

Friday 25 May 2012

Getting Started With emacs

Emacs is the text editor that I use day to day for the majority of my tasks, including programming, note taking and scripting.

It's simple, fast and powerful. Here I'm going to attempt to give you a quickstart guide to hopefully give you a leg up over the learning wall that is emacs.
Yes, I accept it's not the most intuitive thing to begin with but bare with it - you'll get used to it and never look back. Gone are the days of moving your hands back and forth to and from the mouse whilst editing text.

Emacs has so many features and addons that an oft quoted phrase is "Emacs is a great operating system, shame about the text editor" - obviously tongue in cheek.

There are other text editors available. But let's not discuss them. Emacs does also have a good little built in tutorial but I thought I would write this one so you can browse at your leisure. It also lets me document the commands and functionality I use most.

Ok so assuming you have successfully installed and setup emacs (not going to cover that - we'll assume you're proficient with synaptic or w/e) the first thing you are greeted with is a fairly drab looking UI:

So this basically shows the main text editing area with a line along the bottom indicating all sorts of modes that you might be in and then followed by the socalled 'minibuffer' - this is an important area (as you'll see imminently). Otherwise, it looks pretty much like your regular text editor right? The main area in the middle is called the main buffer.

So you want to start typing? Easy, click file->visit new file which will then prompt you to choose a name for it - choose one and then click OK. Easy.

Ok maybe that was too easy and a bit cheaty. That has opened opened up a new file of your choosing...but how do you do that without the mouse? When you clicked file you might have notice that next to the visit new file text it said C-x C-f. That's emacs speak for Ctrl-x Ctrl-f. Try it now. When you type Ctrl and x together followed by Ctrl-f together you'll see that the minibuffer is activated (the empty area at the bottom) prompting you to 'find file'. You can use this like a regular command prompt almost and tab-complete your way to the location your pre-existing or non-existent file that you wish to edit/create.

Thats the basics - you can open your file. That's taught you the complete basics. From there all you have to know is that M-x means alt-x (or esc-x on some systems?).

Yes it's a bit fiddly and makes your hands hurt after a while but trust me, it gets easier.
The table below lists some of my most favoured emacs commands that I use on a daily basis. The all follow the same form - either ctrl-x (c-x) or alt-x (m-x) followed by some other modifier.

The all important one to remember is c-g. That cancels whatever you might be in the middle of. e.g. You're in the middle of opening a file and you realise you need to modify the buffer (file that is currently open) so you c-g out of it to get back to the buffer. phew.

Useful Commands

C-x C-fOpen a buffer
C-x C-sSave the file
C-x C-wSave the file with a new name
C-homeMove to the start of the buffer
C-endGo to the end of the buffer
C-spaceBegin highlighting
M-wCopy
C-wCut
C-yPaste
M-x replace-stringReplace string (find & replace) use the minibuffer to guide you through a string replacement (yes you literally type in replace-string)
M-x query-replaceReplace string with prompts for each find
C-sSearch for a given string using the minibuffer
M-x reverse-regionReverse the lines in a region
C-x C-cClose emacs
C-x 3Split the screen vertically
C-x 2Split the screen horizontally
C-x 1Make this buffer fill the window
C-x 0Close the active window
C-x C-bSelect a buffer by name
M-x C++-modeChange highlighting C++-mode (you can insert your current language, e.g. python-mode). It generally does this automatically depending on the file extension.

Another handy thing to do is to left click with the ctrl modifier pushed down, this will give you a handy little menu to choose between the currently open buffers.

That's the basics...I will most likely post again with details of the customisations I use for smooth scrolling and pretty colour schemes.

Can we just autofill city and state? Please!

Coming from a country that is not the US where zip/postal codes are hyper specific, it always drives me nuts when you are filling in a form ...