There's quite a lot of uncertainty or ambiguity in OpenCL about the purpose and implementation of some of the functions. For example, the clCreateBuffer function doesn't explicitly define whether the buffer is created on the host or device. I generally take it to mean that the data is created and stored on the host until a clEnqueueWriteBuffer or clEnqueueUnmapMemObject is called. However, ambiguity can arise, especially in shared memory systems such as AMD Fusion (although not under Linux yet ) whereby the fused GPU and CPU share memory and thus completely remove the need for a memory copy. This is known as zero-copy in AMD.
Here I will be testing and benchmarking a little known feature of OpenCL, and that is memory mapping. Specifically, with a discrete GPU. Memory mapping is a feature whereby you create a host side memory buffer, for example with the following command:
clCreateBuffer (ctx, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*10000, NULL, &errCode);
which you can then implicitly write to the GPU - letting the implementation handle the underlying working of the read and write for that buffer object. 
Assuming we're using the NVIDIA SDK (version 295.40) on my GTX 560 this will create a pinned host side buffer of 40KB. Pinned here means it won't be paged out by the OS and also allows the GPU to use direct memory access (DMA) if it so wishes/requires. To read and write this pinned memory you have to use memory mapping: you map the OpenCL buffer into some regular C pointer than you pass that around freely without having to worry about getting it into OpenCL buffers.
Classically, what we would do now is a clEnqueueWriteBuffer command to send this data over to the GPU. We can see below that using this methodology, the pinned memory gives a serious advantage already over unpinned memory. This enhancement can be seen to be widening at ever increasing sizes. (yes, I know using 'millions of integers' is a weird metric but just times by 4 for megabytes).
OK, so now we know that pinned memory gives a performance advantage. That's not really very exciting since NVIDIA already told us that (page 47).
So what can we do from here? Well as I mentioned earlier, you can map memory to get the GPU to handle the reads and writes across PCIe (although in the NVIDIA guides they don't really talk about this too much).
The good thing about the example above is that you, as the developer, have full control over exactly what is and isn't written/read over the PCIe. When you want to append more data to a device side buffer you just do a clEnqueueWriteBuffer using the requisite offset and data. 
Using the fully mapped methodology, things aren't quite so clear and we (I) don't really know how it will perform: what happens when you do a read from global memory on the GPU to memory that is mapped over PCIe? Does it do a sequential read? Does it fetch it in chunks? 
Performance of Mapped Memory Objects
Ok so first things first we do our clCreateBuffer as per usual and then map it. Fill it with your data and now, critically, unmap it. (Previously I seem to have gotten away with not unmapping it!). The unmapping ensures consistency for the memory object and enqueues a write across to the device with all the data that has just been unmapped - since as we'll see later it performs just as well as regularly written buffers. The critical point being, there are no clEnqueueWriteBuffers in this process!
I've chosen the following (fairly naïve) scenario whereby you send the data to the device and never care about reading it off; we want some code that will be pretty intensive on the memory accesses to highlight exactly what is going on. 
Each kernel runs through a range of 0 -> x memory addresses in two buffers doing a pairwise addition and finally a division by some constant. This will tell us how fast the GPU is at reading memory for ever increasing values of x. It should demonstrate whether any chunking is in effect. For testing we'll just use 100 workitems in a single workgroup.
So the test is: does the kernel run faster using explicit (clEnqueueWriteBuffer) or implicit (clEnqueueUnmapMemObject)? And then, if they perform the same, how does the performance (time) of clEnqueueWriteBuffer compare to that of clEnqueueUnmapMemObject?
The first chart, here, confirms my suspicions: the clEnqueueUnmapMemObject actually simply just writes the data across to the GPU and thus kernels using a single clEnqueueUnmapMemObject or a single clEnqueueWriteBuffer at the beginning perform the same way.
The question now remains, does EnqueueUnmapMemObject perform as well as clEnqueueWriteBuffer? How does the performance compare when we want to modify the data on the device?
Thus...
This gives us some pretty interesting results. It shows that, by manually doing reads and writes from the device you shave off a small yet perceptible amount of time for each iteration. Thus this performance difference must lie in the way that the underlying implementation handles the read and the remapping of the buffer (since we know the unmapping and the write perform identically from earlier on).
Thus...
Map & Unmap or Read & Write?
We have two choices therefore, depending on our use case. Let's imagine we want to write some data, do some computation on it and then read it off and modify it. There are two ways of doing this...- Map -> edit -> unmap -> compute (then rinse and repeat)
- Read -> edit -> write -> compute (rinse and repeat)
This gives us some pretty interesting results. It shows that, by manually doing reads and writes from the device you shave off a small yet perceptible amount of time for each iteration. Thus this performance difference must lie in the way that the underlying implementation handles the read and the remapping of the buffer (since we know the unmapping and the write perform identically from earlier on).
This might just be a quirk of the SDK that will be ironed out in the future, however it does provide some interesting discussion points even for this contrived example. For example, whats the point of mapping and unmapping when you can read and write faster? As mentioned previously, how will these interplay when you start having fused memories? Will you be able to still maintain portability for your code knowing that the mapped method is slower?
This just highlights the importance of performing these little benchmarks for the architecture and setup you have at hand.



 
 
 
This comment has been removed by the author.
ReplyDelete