标签:tac lin 2016年 amp 业界 流控 inter UI with
限于能力和时间,文中定有不少错误,欢迎指出,邮箱yixiangrong@hotmail.com, 期待讨论。由于绝大部分是原创,即使拷贝也指明了出处(如有遗漏请指出),所以转载请表明出处http://www.cnblogs.com/e-shannon/
http://www.cnblogs.com/e-shannon/p/7495618.html
正如第一节所说的,为了满足加速需求,业界为CPU高性能一致性接口(high performance coherence interface)定义开放的标准,2016年出现了openCAPI/Gen-Z/CCIX 三种open标准
三个标准的目的相似,侧重点有少许差异,成员也互相交叉,甚至有成员在三个组里(当然intel不在这三个组里)。值得一提的是加速接口不但加速CPU,而且提供面向未来的高速接口,比如接以后高速内存,高速网络存储,高速网络等,组成高效的计算机群。
各自的网址:www.ccixconsortium.com
http://genzconsortium.org/
www.opencapi.org
摘自CCIX,Gen-Z,penCAPI_Overview&Comparison.pdf
CCIX物理介质基于PCIE3.0,实现处理器和加速器全cache一致性,适用于低延迟内存扩展,CPU加速,网络存储。
而Gen-Z则侧重于机框之间的互联一致性加速,也支持PCIE物理层和调整过的802.3电气层,当然也声称支持内存,网络设备等。“Gen-Z’s primary focus is a routable, rack-level interconnect to give access to large pools of memory, storage, or accelerator resources,
OpenCAPI则获得了IBM的Power9支持,采用BlueLink高速接口,约25 Gb/sec,具有低延迟,以及匹配主要内存带宽的超宽带. OpenCAPI will be concerned primarily with attaching various kinds of compute to each other and to network and storage devices that have a need for coherent access to memory across a hybrid compute complex. . With OpenCAPI, I/O bandwidth can be proportional to main store bandwidth, and with very low 300 nanosecond to 400 nanosecond latencies, you can put storage devices out there, or big pools of GPU or FPGA accelerators and let them have access to main store and just communicate to it seamlessly。OpenCAPI is a ground-up design that enables extreme bandwidth that is on par with main memory bandwidth
Gen-z资料容易获得,最难的是CCIX,需要会员。
Yxr注:没有细细研究三者的区别联系,所以就简单的把资料拷贝出来,仅供参考,其实感觉作者也没有深究
但是觉得
本文重点介绍OpenCAPI,因为其获得Power9支持。
先IBM初衷是设计一个开放的标准,和CPU架构无关的加速接口,所以将OpenCAPI剥离出了OpenPower,这样其他CPU厂家也能够加入(不清楚intel是否加入,毕竟两者曾经合作过infiniband) 在文中https://www.nextplatform.com/2016/10/17/opening-server-bus-coherent-acceleration/竟然也称OpenCAPI是CAPI3.0,无语了。
OpenCAPI其层次划分类似PCIE, 共三层,分别是 phy层,DL(data link), TL(trasaction layer),但是与PCIE不一样的是,遵循Open CAPI is agnostic(不可知) to processor architecture,没有定义phy层,而死由用户自己定义,IBM的power9则采用了bluelink,其能够与nvlink复用(详见POWER9的图片)。OpenCAPI仅仅定义了DL和TL,TL层也采用credit 来进行流控,openCAPI采用了virtual Adress,相比较PCIE,最大的优势优化了延迟latency ,简化了设计,功耗面积均优于PCIE
针对PCIE架构的局限性,延迟大,带宽仍然跟不上内存带宽,以及缺少coherency,在power9的openCAPI中,推出了bluelink物理接口,25Gbps x 48lanes,并且可以在其上跑nvlink2.0,支持nvidia的GPU加速。这个也是Google和服务器厂家Rackspace在“Zaius[dream1] ” server上采用OpenCAPI端口,xilinx也推出支持IP的原因。The PCI-Express stack is a limiter in terms of latency, bandwidth, and coherence. This is why Google and Rackspace are putting OpenCAPI ports on their co-developed Power9 system, and why Xilinx will add them to its FPGAs, Mellanox to its 200 Gb/sec InfiniBand cards, and Micron to its flash and 3D XPoint storage.
Yxr注:由于openCAPI没有定义phy层,所以其他CPU厂商,arm,AMD,intel也可以定义自己的phy,在其上跑nvlink2.0和openCAPI..
以下是OpenCAPI的比较优势,记住PCIE的round trip latency 为100ns,gen-z好像也是这个目标。
1. Server memory latency is critical TOC factor
Differential solution must provide ~equivalent effective latency of DDR standards
POWER8 DMI round trip latency ? 10ns
Typical PCIe round trip latency ? ~100s ns
Why is DMI so low?
DMI designed from ground up for minimum latency due to ld/str requirements
Open CAPI key concept
Provide DMI like latency, but with enhanced command set of CAPI
2.Virtual address based cache,
Eliminates kernel and device driver software overhead
Improves accelerator performance
Allows device to operate directly on application memory without kernel-level data copies or pinned pages
Simplifies programming effort to integrate accelerators into applications
The Virtual-to-Physical Address Translation occurs in the host CPU
Yxr注:有资料提到OpenCAPI的缺点是到OpenCAPI 4.0才能实现cache coherent,现在是memory coherent,自己不甚理解
OpenCAPI data link层支持每条lane25 Gbps的串行数据,基本的配置是8条lane,每天25.78125GHz。在host侧,称为DL,在OpenCAPI侧称为DLX
Link training由带外信号OCDE复位开始,一个link的training分成三个部分:PHY training,PHY 初始化,DL training
通过training完成速度匹配,时钟匹配,链路同步,以及lane的信息交换。
DL的流控采用flit包,应该是ACK,replay机制完成
DL采用64b/66b 编码方式,LFSR扰码(具体公式待查)
核心部分,较复杂
如下图所示,OpenCAPI将PSL放入CPU侧,这样的好处是使得OpenCAPI不在与CPU架构OpenPower有关,便于其他CPU厂家采纳,cache以及一致性均封装在CPU中。物理层采用bluelink,减少了PCIE的延迟,提高了带宽。当然PCIE的好处是采用的厂家众多,PSL(含cache)放入到AFU侧,也是为了克服PCIE的局限性。
While CAPI was governed by IBM and metered across the OpenPOWER Consortium, OpenCAPI is completely open, governed by the OpenCAPI Consortium led by the companies I listed above. The OpenCAPI consortium says they plan to make the OpenCAPI specification fully available to the public at no charge before the end of the year. Mellanox Technologies, Micron, and Xilinx were CAPI supporters, OpenPOWER members, and are now part of OpenCAPI. NVIDIA and Google were part of OpenPOWER and are now OpenCAPI members
这些问题主要是自己在学习的时候的疑惑以及自己猜测的答案,分享之。
1) Q:既然OpenCAPI如此优秀,是否CAPI之后没有升级的必要?
A by yxr: 猜测,CAPI是IBM主导的,与openPower绑定,而OpenCAPI与CPU ISA无关。IBM可以不用顾虑太多,专注于OpenPower架构,独立前行。当然有人称OpenCAPI是CAPI3.0,所以也许会被替换
2) Q:为何CAPI需要将PSL放入到accelerate 侧,还做成了IP。PSL应含有cache,与CAPP一起负责cache coherent。 所以和OpenCAPI的实质区别是,cache以及地址翻译是放在CPU侧还是accelerate侧,为何CAPI没有这样的考虑,即将PSL放在CPU侧?
A by yxr: 猜测,其一如果PSL放在CPU侧,导致由于cache使得芯片面积变大;其二从accelerator作为peering CPU 角度来看,cache应该紧跟CPU,效率才提升明显(这个就是cache的意义)。Accelerator仅仅只访问本地cache也可以避开了PCIE round trip latency过大的问题。当然一旦OpenCAPI物理链路(采用bluelink)的访问延迟足够低,这样使得cache放在CPU侧也不影响性能。
这里引出了第三个问题,关于CCIX的问题,其采用PCIE作为物理线路,不知道如何避免延迟过大的问题!!!
3) Q: CAPI和CCIX均是使用PCIE作为物理线路,这样latency将无可避免的大,CCIX如何克服?CCIX的cache放在哪一侧?两者的区别和各自的优点是什么?
4) Q: OpenCAPI 3.0是否没有实现cache coherent,而只是实现memory coherent? 什么是memory coherent?
A by yxr:这个让我很吃惊,一直认为coherent就是cache coherent。这里所谓的memory coherent可能是跨机框的计算机集群之间(比如hadhoop),memory 如何保持一致性的意思吧
后期计划
1) 熟悉OpenCAPI的协议层次,尤其TL,关注如何在PSL放入host侧,完成cache coherency
关注其协议接口和CAPI的区别
[dream1]Zaius is a dual-socket platform based on the IBM POWER9 Scale Out CPU. It supports a host of new technologies including DDR4 memory, PCIE Gen4 and the OpenCAPI interface. It’s designed with a highly efficient 48V-POL power system and will be compatible with the 48v Open Rack V2.0 standard. The Zaius BMC software is being developed using Open BMC, the framework for which we’ve released on GitHub. Additionally, Zaius will support a PCIe Gen4 x16 OCP 2.0 mezzanine slot NIC
标签:tac lin 2016年 amp 业界 流控 inter UI with
原文地址:http://www.cnblogs.com/e-shannon/p/7496194.html