当前位置: 首页 > news >正文

郑州那家做网站便宜东莞网络推广培训

郑州那家做网站便宜,东莞网络推广培训,怎么注册国外网站,重庆企业网站如何推广GPU-Puzzles项目可以让你学习到GPU编程和cuda核心并行编程的概念,通过一个个小问题让你理解cuda的编程和调用,创建共享显存空间,实现卷积和矩阵乘法等,通过每个小问题之后还会奖励一个狗狗小视频😁 下面是项目的仓库&…

GPU-Puzzles项目可以让你学习到GPU编程和cuda核心并行编程的概念,通过一个个小问题让你理解cuda的编程和调用,创建共享显存空间,实现卷积和矩阵乘法等,通过每个小问题之后还会奖励一个狗狗小视频😁

下面是项目的仓库:https://github.com/srush/GPU-Puzzlesicon-default.png?t=O83Ahttps://github.com/srush/GPU-Puzzles

我本人也是做完了所有的puzzles,特地做一份讲解供大家参考

Puzzle 1: Map 

Implement a "kernel" (GPU function) that adds 10 to each position of vector a and stores it in vector out. You have 1 thread per position.

题目的目的是让out输出为a中所有元素+10

def map_spec(a):return a + 10def map_test(cuda):def call(out, a) -> None:local_i = cuda.threadIdx.x# FILL ME IN (roughly 1 lines)out[local_i] = a[local_i] + 10return callSIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem("Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec
)
problem.show()

这里不能完全用Python代码的思想去阅读,函数每次调用cuda.threadId.x的时候都会取一个新的核心,实现并行的效果,下面是可视化运行效果

# MapScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             1 |             1 |             0 |             0 | 

Puzzle 2 - Zip

Implement a kernel that adds together each position of a and b and stores it in out. You have 1 thread per position.

在out中每个元素是a,b向量中对应位置元素和,这里不需要什么额外操作,直接local_i作为对应位置的索引即可

def zip_spec(a, b):return a + bdef zip_test(cuda):def call(out, a, b) -> None:local_i = cuda.threadIdx.x# FILL ME IN (roughly 1 lines)out[local_i] = a[local_i] + b[local_i]return callSIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem("Zip", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec
)
problem.show()

可视化效果


# ZipScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             2 |             1 |             0 |             0 | 

Puzzle 3 - Guards 

Implement a kernel that adds 10 to each position of a and stores it in out. You have more threads than positions.

这里是Map的升级版,我们拥有更多的cuda线程,但是这不影响,利用if判断使local_i索引有效即可

def map_guard_test(cuda):def call(out, a, size) -> None:local_i = cuda.threadIdx.x# FILL ME IN (roughly 2 lines)if (local_i < size):out[local_i] = a[local_i] + 10return callSIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem("Guard",map_guard_test,[a],out,[SIZE],threadsperblock=Coord(8, 1),spec=map_spec,
)
problem.show()

可视化效果

# GuardScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             1 |             1 |             0 |             0 | 

Puzzle 4 - Map 2D

Implement a kernel that adds 10 to each position of a and stores it in out. Input a is 2D and square. You have more threads than positions.

这里更进了一步,教我们cuda可以创建二维的线程,我们创建了多余的线程,需要更多边界判断

def map_2D_test(cuda):def call(out, a, size) -> None:local_i = cuda.threadIdx.xlocal_j = cuda.threadIdx.y# FILL ME IN (roughly 2 lines)if local_i < size and local_j < size:out[local_i, local_j] = a[local_i, local_j]+1return callSIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
problem = CudaProblem("Map 2D", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec
)
problem.show()

可视化效果

# Map 2DScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             1 |             1 |             0 |             0 | 

Puzzle 5 - Broadcast

Implement a kernel that adds a and b and stores it in out. Inputs a and b are vectors. You have more threads than positions.

其实就是前面zip的二维版本,对行向量和列向量相同索引位置相加,线程比实际矩阵大,要注意边界

def broadcast_test(cuda):def call(out, a, b, size) -> None:local_i = cuda.threadIdx.xlocal_j = cuda.threadIdx.y# FILL ME IN (roughly 2 lines)if local_i < size and local_j < size:out[local_i, local_j] = a[local_i, 0] + b[0, local_j]return callSIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE).reshape(SIZE, 1)
b = np.arange(SIZE).reshape(1, SIZE)
problem = CudaProblem("Broadcast",broadcast_test,[a, b],out,[SIZE],threadsperblock=Coord(3, 3),spec=zip_spec,
)
problem.show()

可视化效果

# BroadcastScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             2 |             1 |             0 |             0 | 

Puzzle 6 - Blocks

Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a.

这里和前面不同,每一块中线程数比矩阵要小,但是这也不影响,因为把块是把线程分组了,每个块有一定数量的线程,程序会循环遍历取出一个个线程块,判断好边界条件即可

def map_block_test(cuda):def call(out, a, size) -> None:i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x# FILL ME IN (roughly 2 lines)if i < size:out[i] = a[i] + 10return callSIZE = 9
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem("Blocks",map_block_test,[a],out,[SIZE],threadsperblock=Coord(4, 1),blockspergrid=Coord(3, 1),spec=map_spec,
)
problem.show()

可视化效果

# BlocksScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             1 |             1 |             0 |             0 | 

Puzzle 7 - Blocks 2D 

Implement the same kernel in 2D. You have fewer threads per block than the size of a in both directions.

这里线程矩阵比实际的矩阵大小要小,但是没关系,设置好边界条件直接遍历即可

def map_block2D_test(cuda):def call(out, a, size) -> None:i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x# FILL ME IN (roughly 4 lines)j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.yif i < size and j < size:out[i, j] = a[i, j] + 10return callSIZE = 5
out = np.zeros((SIZE, SIZE))
a = np.ones((SIZE, SIZE))problem = CudaProblem("Blocks 2D",map_block2D_test,[a],out,[SIZE],threadsperblock=Coord(3, 3),blockspergrid=Coord(2, 2),spec=map_spec,
)
problem.show()

可视化效果

# Blocks 2DScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             1 |             1 |             0 |             0 | 

Puzzle 8 - Shared 

Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a.

这里是利用共享内存的教学,cuda申请共享内存只能用静态变量

TPB = 4
def shared_test(cuda):def call(out, a, size) -> None:shared = cuda.shared.array(TPB, numba.float32)i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.xlocal_i = cuda.threadIdx.xif i < size:shared[local_i] = a[i]cuda.syncthreads()# FILL ME IN (roughly 2 lines)if I < size:out[i] = shared[local_i] + 10return callSIZE = 8
out = np.zeros(SIZE)
a = np.ones(SIZE)
problem = CudaProblem("Shared",shared_test,[a],out,[SIZE],threadsperblock=Coord(TPB, 1),blockspergrid=Coord(2, 1),spec=map_spec,
)
problem.show()

可视化效果

# SharedScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             1 |             1 |             1 |             1 | 

Puzzle 9 - Pooling

Implement a kernel that sums together the last 3 position of a and stores it in out. You have 1 thread per position. You only need 1 global read and 1 global write per thread.

这里教会我们要控制全局读写次数,如果把需要重复读的内容移动到共享内存,可以提升效率

def pool_spec(a):out = np.zeros(*a.shape)for i in range(a.shape[0]):out[i] = a[max(i - 2, 0) : i + 1].sum()return outTPB = 8
def pool_test(cuda):def call(out, a, size) -> None:shared = cuda.shared.array(TPB, numba.float32)i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.xlocal_i = cuda.threadIdx.x# FILL ME IN (roughly 8 lines)if i < size:shared[local_i] = a[i]cuda.syncthreads()if i == 0:out[i] = shared[local_i]elif i == 1:out[i] = shared[local_i] + shared[local_i - 1]else:out[i] = shared[local_i] + shared[local_i - 1] + shared[local_i - 2]return callSIZE = 8
out = np.zeros(SIZE)
a = np.arange(SIZE)
problem = CudaProblem("Pooling",pool_test,[a],out,[SIZE],threadsperblock=Coord(TPB, 1),blockspergrid=Coord(1, 1),spec=pool_spec,
)
problem.show()

可视化效果

# PoolingScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             1 |             1 |             3 |             1 | 

Puzzle 10 - Dot Product

Implement a kernel that computes the dot-product of a and b and stores it in out. You have 1 thread per position. You only need 2 global reads and 1 global write per thread.

这里是手动实现向量点乘,我们可以先把结果暂存到一个共享内存中,然后最后用一个线程统一计算共享内存的数据并输出到结果,这样可以控制全局读写次数

def dot_spec(a, b):return a @ bTPB = 8
def dot_test(cuda):def call(out, a, b, size) -> None:shared = cuda.shared.array(TPB, numba.float32)i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.xlocal_i = cuda.threadIdx.x# FILL ME IN (roughly 9 lines)if i < size:shared[local_i] = a[i] * b[i]cuda.syncthreads()if local_i == 0:total = 0.0for j in range(TPB):total += shared[j]out[0] = totalreturn callSIZE = 8
out = np.zeros(1)
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem("Dot",dot_test,[a, b],out,[SIZE],threadsperblock=Coord(SIZE, 1),blockspergrid=Coord(1, 1),spec=dot_spec,
)
problem.show()

可视化效果

# DotScore (Max Per Thread):|  Global Reads | Global Writes |  Shared Reads | Shared Writes ||             2 |             1 |             8 |             1 | 

后面还有puzzle11-14,因为难度比较大,思路较为复杂,留在下一篇文章做讲解

​​​​​​​👉🏻GPU Puzzles讲解(二)-CSDN博客

http://www.yayakq.cn/news/289937/

相关文章:

  • 山西建设厅官方网站专家库设计网站公司
  • 做外贸在哪个网站好什么网站做一手房好
  • 酷炫网站源码怎么找国外采购商
  • vs2017网站开发选择调试服务wordpress个人简介栏
  • 大型网站开发实战网页版梦幻西游是网易的吗
  • 阿里巴巴国际站怎么找客户山东潍坊建设银行招聘网站
  • 当阳网站建设wordpress 详情页
  • 深圳房地产网站开发软件定制流程
  • 网站打开404错误怎么解决百度主动提交工具 wordpress
  • 校园网门户网站建设北京 酒店 企业 网站建设
  • 孟村县网站建设网站开发需求确认书
  • 聊城网站定制毕业设计实在不会怎么办
  • 培训餐饮网站建设中国电子工程网
  • 网站标题第一个词接外包项目
  • 婚庆网站怎么设计模板做一元夺宝网站需要什么条件
  • 网站架构师培训微信公众号内嵌网站开发
  • 网站用什么好线上网站开发系统流程
  • wordpress 模版教程优化seo排名
  • 在线免费网站建设平台互联网开发工资一般多少
  • 网站开发实例百度云网站推广计划书范文500字
  • 做毕业设计网站的步骤品牌seo是什么
  • 四川通江县住房和建设局网站高大上 网站
  • 转转假网站怎么做信息无障碍 网站建设
  • 网站 做 vga公司网站实名制
  • 成都网站优化网wordpress热门文章插件
  • 网页设计教程 模仿西安百度关键词优化
  • 做食品网站有哪些东西公司网站域名查询
  • 网站建设公司百家号智能建造技术专业学什么
  • 公司网站建设周期及费用网页设计软件免费
  • 中国建设网官方网站e路护航工程施工人员招聘网站