Intro to Parallel Programming课程笔记001

时间：2015-01-02 09:50:16 阅读：279 评论：0 收藏：0 [点我收藏+]

标签：parallel programming

Intro to Parallel Programming

How do you dig a hole faster?

GPU理念

很多很多简单计算单元；

清洗的并行计算模型；

关注吞吐量而非延迟；

CPU: HOST

GPU:DEVICE

A Typical GPU Program

1,CPUallocates(分配) storage on GPU cuda Malloc

2,CPUcopies input data from CPU-GPU cuda Memcpy

3,CPUlaunches kernel(s) on GPU to process the data Kernel launch

4,CPUCOPIES RESULTS BACK TO CPU FROM GPU cuda Memcpy

最好是最后一步将GPU数据拷回

DEFINING THE GPU COMPUTATION

BIG IDEA

KERNELS LOOK LIKE SERIAL PROGRAMS

WRITE YOUR PROGRAM AS IF IT WILL RUN ON ONE THREAD

THE GPU WILL RUN THAT PROGRAM ON MANY THREADS

MAKE SUREYOU UNDERSTAND THIS

THIS ISIMPORTANT

WHAT IS GPU GOOD AT?

1,EFFICIENTLY LAUNCHING LOTS OF THREADS

2,RUNNING LOTS OF THREADS IN PARALLEL

SIMPLE EXAMPLE:

IN: FLOATARRAY [0 1 2 … 36]

OUT: FLOATARRAY [0 1X1 2X2 … 63X63]

[0 1 4 9 … ]

KERNEL: SQUARE

CPU CODE: SQUARE EACH ELEMENT OF ANARRAY

For (i=0; i<64;i++){

Out[i]=in[i]*in[i];

}

1, ONLYONE THREAD OF EXECUTION

（"thread”=one Independentpath of execution through the code”）

2,NOEXPLICIT PARALLELSIM

GPU CODE: A HIGH-LEVEL VIEW

CPU

ALLOCATEMEMORY

COPYDATA TO/FROM GPU

LAUNCHKERNEL

SPECIFIES DEGREE OFPARALLELISM

GPU

EXPRESSOUT = IN . IN

SAMS NOTHING

ABOUT THE DEGREE OFPARALLELISM

CPU CODE: square kernel <<< 64 >>>(outArrayinArray)

BUT HOW DOES IT WORK IF I LAUNCH 64INSTANCES OF THE SAME PROGRAM?

CPU LAUNCHES 64 THREADS

#include <stdio.h>

__global__ void cube(float * d_out, float *d_in){

//Todo: Fill in this function

}

int main(int argc, char ** argv) {

constint ARRAY_SIZE = 96;

constint ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

//generate the input array on the host

floath_in[ARRAY_SIZE];

for(int i = 0; i < ARRAY_SIZE; i++) {

h_in[i]= float(i);

}

floath_out[ARRAY_SIZE];

//declare GPU memory pointers

float* d_in;

float* d_out;

//allocate GPU memory

cudaMalloc((void**)&d_in, ARRAY_BYTES);

cudaMalloc((void**)&d_out, ARRAY_BYTES);

//transfer the array to the GPU

cudaMemcpy(d_in,h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

//launch the kernel

cube<<<1,ARRAY_SIZE>>>(d_out, d_in);

//copy back the result array to the CPU

cudaMemcpy(h_out,d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

//print out the resulting array

for(int i =0; i < ARRAY_SIZE; i++) {

printf("%f",h_out[i]);

printf(((i% 4) != 3) ? "\t" : "\n");

}

cudaFree(d_in);

cudaFree(d_out);

return0;

}

Configuring the kernel Launch

SQUARE<<<1,64>>>(d_ou,d_in)

<<<block数量，每个block的线程>>>

1,可以一次run多个block

2,每个threads/block的最大值（512老版本）

(1024新版本)

128县城 SQUARE<<<1,128>>>( ….)

1280县城 SQUARE<<<10,128>>>( ….)

SQUARE<<<5,256>>>( …. )

KERNAL<<<GRID OF BLOCKS,BLOCK OFTHREADS>>>(…)

KERNAL<<<1,2OR3D, 1,2OR3D >>>(…)

Dim3(x,y,z)

Dim3(w,1,1)==dim3(w)==w

Square<<<1,64>>>==square<<<dim3(1,1,1),dim3(64,1,1)>>>

Kernel<<<grid of blocks,block ofthreads>>>( … )

Square<<<dim3(bx,by,bz),dim3(tx,ty,tz),shmem>>>(… )

Square<<<gridof blocks bx.by.bz,block of threads tx.ty.tz, shared memory per block in bytes>>>

Thread idx :thread within block

Thread idx.xthread idx.y

Block dim: size ofa block

Block idx: blockwithin grid

gridDim: size ofgrid

MAP

setof elements to process [64 floats]

functionto run on each element [“square”]

map(elements,function)

gpuare good at map

--gpu have many parallel processors

--gpu optimize for throughput

Struct uchar4{

Unsigned char x;

Unsigned char y;

Unsigned char z;

Unsigned char w;

}

Converting color to black and white

I = (R+G+B)/3

I = .299f*R + .587f*G + .114f*B

Intro to Parallel Programming课程笔记001

标签：parallel programming

原文地址：http://blog.csdn.net/syzmatrix/article/details/42321729

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行