码迷,mamicode.com
首页 > 其他好文 > 详细

Intro to Parallel Programming课程笔记001

时间:2015-01-02 09:50:16      阅读:279      评论:0      收藏:0      [点我收藏+]

标签:parallel programming

Intro to Parallel Programming

 

How do you dig a hole faster?

 

GPU理念

很多很多简单计算单元;

清洗的并行计算模型;

关注吞吐量而非延迟;

 

 

CPU: HOST

GPU:DEVICE

 

 

 

A Typical GPU Program

         1,CPUallocates(分配) storage on GPU  cuda Malloc

         2,CPUcopies input data from CPU-GPU          cuda Memcpy

         3,CPUlaunches kernel(s) on GPU to process the data             Kernel launch

         4,CPUCOPIES RESULTS BACK TO CPU FROM GPU           cuda Memcpy

最好是最后一步将GPU数据拷回

 

 

 

DEFINING THE GPU COMPUTATION

BIG IDEA

KERNELS LOOK LIKE SERIAL PROGRAMS

WRITE YOUR PROGRAM AS IF IT WILL RUN ON ONE THREAD

THE GPU WILL RUN THAT PROGRAM ON MANY THREADS

MAKE SUREYOU UNDERSTAND THIS

THIS ISIMPORTANT

 

WHAT IS GPU GOOD AT?

         1,EFFICIENTLY LAUNCHING LOTS OF THREADS

         2,RUNNING LOTS OF THREADS IN PARALLEL

 

SIMPLE EXAMPLE:

IN:    FLOATARRAY   [0 1 2 … 36]

OUT:                   FLOATARRAY   [0 1X1 2X2 … 63X63]

                     [0 1 4 9 … ]

KERNEL:            SQUARE

 

CPU CODE: SQUARE EACH ELEMENT OF ANARRAY

For (i=0; i<64;i++){

         Out[i]=in[i]*in[i];

}

 

1, ONLYONE THREAD OF EXECUTION

("thread”=one Independentpath of execution through the code”)

2,NOEXPLICIT PARALLELSIM

 

GPU CODE:  A HIGH-LEVEL VIEW

CPU

ALLOCATEMEMORY

COPYDATA TO/FROM GPU

LAUNCHKERNEL

SPECIFIES DEGREE OFPARALLELISM

 

GPU

EXPRESSOUT = IN . IN

SAMS NOTHING

ABOUT THE DEGREE OFPARALLELISM

 

CPU CODE:  square kernel <<< 64 >>>(outArrayinArray)

 

BUT HOW DOES IT WORK IF I LAUNCH 64INSTANCES OF THE SAME PROGRAM?

 

CPU LAUNCHES 64 THREADS

 

#include <stdio.h>

 

__global__ void cube(float * d_out, float *d_in){

         //Todo: Fill in this function

}

 

int main(int argc, char ** argv) {

         constint ARRAY_SIZE = 96;

         constint ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

 

         //generate the input array on the host

         floath_in[ARRAY_SIZE];

         for(int i = 0; i < ARRAY_SIZE; i++) {

                   h_in[i]= float(i);

         }

         floath_out[ARRAY_SIZE];

 

         //declare GPU memory pointers

         float* d_in;

         float* d_out;

 

         //allocate GPU memory

         cudaMalloc((void**)&d_in, ARRAY_BYTES);

         cudaMalloc((void**)&d_out, ARRAY_BYTES);

 

         //transfer the array to the GPU

         cudaMemcpy(d_in,h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

 

         //launch the kernel

         cube<<<1,ARRAY_SIZE>>>(d_out, d_in);

 

         //copy back the result array to the CPU

         cudaMemcpy(h_out,d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

 

         //print out the resulting array

         for(int i =0; i < ARRAY_SIZE; i++) {

                   printf("%f",h_out[i]);

                   printf(((i% 4) != 3) ? "\t" : "\n");

         }

 

         cudaFree(d_in);

         cudaFree(d_out);

 

         return0;

}

 

 

 

 

 

Configuring the kernel Launch

SQUARE<<<1,64>>>(d_ou,d_in)

<<<block数量,每个block的线程>>>

1,可以一次run多个block

2,每个threads/block的最大值(512老版本)

(1024新版本)

 

128县城          SQUARE<<<1,128>>>( ….)

1280县城         SQUARE<<<10,128>>>( ….)

SQUARE<<<5,256>>>( …. )

KERNAL<<<GRID OF BLOCKS,BLOCK OFTHREADS>>>(…)

KERNAL<<<1,2OR3D, 1,2OR3D >>>(…)

Dim3(x,y,z)

Dim3(w,1,1)==dim3(w)==w

Square<<<1,64>>>==square<<<dim3(1,1,1),dim3(64,1,1)>>>

 

Kernel<<<grid of blocks,block ofthreads>>>( … )

Square<<<dim3(bx,by,bz),dim3(tx,ty,tz),shmem>>>(… )

Square<<<gridof blocks bx.by.bz,block of threads tx.ty.tz, shared memory per block in bytes>>>

 

Thread idx :thread within block

Thread idx.xthread idx.y

Block dim: size ofa block

Block idx: blockwithin grid

gridDim: size ofgrid

 

MAP

setof elements to process [64 floats]

functionto run on each element        [“square”]

map(elements,function)

gpuare good at map

--gpu have many parallel processors

--gpu optimize for throughput

 

Struct uchar4{

Unsigned char x;

Unsigned char y;

Unsigned char z;

Unsigned char w;

}

Converting color to black and white

I = (R+G+B)/3

I = .299f*R + .587f*G + .114f*B

 

Intro to Parallel Programming课程笔记001

标签:parallel programming

原文地址:http://blog.csdn.net/syzmatrix/article/details/42321729

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!