cudaMalloc和cudaMallocPitch

时间：2014-12-15 23:17:11 阅读：202 评论：0 收藏：0 [点我收藏+]

标签：blog http io ar 使用 sp for strong on

原文链接

偶有兴趣测试了一下题目中提到的这两个函数，为了满足对齐访问数据，咱们平时可能会用到cudamallocPitch，以为它会带来更高的效率。呵呵，这里给出一段测试程序，大家可以在自己的机器上跑跑，你会发现这两个函数在某些情况下是一样的。

[cpp] view plain copy

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime_api.h>
int main(int argc, char **argv)
{
// device pointers.
float *d_pitch;
float *d_normal;
// matrix size.
size_t cols = 63;
size_t rows = 16;
size_t pitch = 0;
// alloc the data form gpu memory.
cudaMallocPitch((void**)&d_pitch, &pitch, cols*sizeof(float), rows);
cudaMalloc((void**)(&d_normal), rows*cols*sizeof(float));
// test the data address.
fprintf(stdout, "row size(in bytes) = %.2f*128.\n", pitch/128.0f);
fprintf(stdout, "the head address of d_pitch mod 128 = %x.\n", ((unsigned int)d_pitch)%128);
fprintf(stdout, "the head address of d_normal mod 128 = %x.\n", ((unsigned int)d_normal)%128);
cudaFree(d_pitch);
cudaFree(d_normal);
getchar();
return 0;
}

上面这段程序的运行结果如下：

[cpp] view plain copy

row size(in bytes) = 28.00*128.
the head address of d_pitch mod 128 = 0.
the head address of d_normal mod 128 = 0.

我多次做过实验，我觉得从以上实验结果可以知道，无论如何改变实验的参数，两个显存申请函数返回的数据首地址都是128，256的整数倍，我猜想GPU上的每个计算单元的数据在全局中加载的时候一次可以连续加载2的幂次个数据，并且这些数据的加载其实地址一定也是2的幂次，所以warp使用全局内存中的数据的时候应该尽量按照对齐的原则加载数据，这样就可以获得更高的效率了。至于对齐原则可以在CUDA的编程手册中找到。

cudaMalloc和cudaMallocPitch

标签：blog http io ar 使用 sp for strong on

原文地址：http://www.cnblogs.com/liangliangdetianxia/p/4165946.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行