AMD APU Block Diagram |
This test is pretty naïve since we should really be using zero-copy memory. However, since I'm a Linux fanboy we're going to have to use AMD's current implementation of the APP SDK which requires explicit clEnqueueWriteBuffers to put data into device visible system memory which then gets piped through the radeon memory bus (among other things). There's an interesting write up of this technology here.
So if we have to do explicit clEnqueueWriteBuffers to get data from OpenCL host side memory into device side global memory, how does the performance of the CPU and GPU weigh up?
Well this chart shows a very basic test: the timing of the clEnqueueWriteBuffer command to execute (CL_TRUE blocking write obviously) for both the CPU and the GPU in an AMD A8-3850 APU to interesting results...
For a small amount of data |
We can see that this trend becomes a lot more marked at 100x as much data - the CPU latency is much much higher than the GPU - presumably the GPU has higher bandwidth and the path the data takes a lot simpler. Bypassing something? Cacheing?
Anyway, this seems to prove a pretty nice point (good fast speed to get data to GPU visible memory) and I look forward to zero-copy objects on Linux ;)
No comments:
Post a Comment
Leave a comment!