Friday, 29 June 2012

Memory Access Ordering in OpenCL

Just a short one today. When I was modifying the NVIDIA PTX emitted by OpenCL for my code, I noticed the following behaviour.

What I'm doing is, in each workitem I am reading a single t_speed (see the struct below) from an array of t_speeds.

typedef struct 
{
    float speeds[9];
} t_speed;
 
As you might expect there are 9 reads from global memory (one for each float). However, what is interesting is the addresses of the reads from global memory. The PTX reads the data from the higher addresses first and then works its way down from register 14 +32 to the register 14. Very bizarre.

If anyone could explain this I would be very interested to know.

ld.global.f32   %f190, [%r14+32];
ld.global.f32   %f8, [%r14+28];
ld.global.f32   %f7, [%r14+24];
ld.global.f32   %f187, [%r14+20];
ld.global.f32   %f5, [%r14+16];
ld.global.f32   %f4, [%r14+12];
ld.global.f32   %f194, [%r14+8];
ld.global.f32   %f195, [%r14+4];
ld.global.f32   %f200, [%r14]; 
 
(I know PTX isn't actual NVIDIA assembly language but just an intermediary before SASS, so perhaps this is just a quirk of the intermediate representation?)

No comments:

Post a Comment

Leave a comment!

Can we just autofill city and state? Please!

Coming from a country that is not the US where zip/postal codes are hyper specific, it always drives me nuts when you are filling in a form ...