Just a short one today. When I was modifying the NVIDIA PTX emitted by OpenCL for my code, I noticed the following behaviour.
What I'm doing is, in each workitem I am reading a single t_speed (see the struct below) from an array of t_speeds.
What I'm doing is, in each workitem I am reading a single t_speed (see the struct below) from an array of t_speeds.
typedef struct { float speeds[9]; } t_speed;
As you might expect there are 9 reads from global memory (one for each float). However, what is interesting is the addresses of the reads from global memory. The PTX reads the data from the higher addresses first and then works its way down from register 14 +32 to the register 14. Very bizarre.
If anyone could explain this I would be very interested to know.
ld.global.f32 %f190, [%r14+32]; ld.global.f32 %f8, [%r14+28]; ld.global.f32 %f7, [%r14+24]; ld.global.f32 %f187, [%r14+20]; ld.global.f32 %f5, [%r14+16]; ld.global.f32 %f4, [%r14+12]; ld.global.f32 %f194, [%r14+8]; ld.global.f32 %f195, [%r14+4]; ld.global.f32 %f200, [%r14];
(I know PTX isn't actual NVIDIA assembly language but just an intermediary before SASS, so perhaps this is just a quirk of the intermediate representation?)
No comments:
Post a Comment
Leave a comment!