标签:live 说明 ram out rsa nec cache table dap
BTB单元
Branch prediction in z processors is performed ‘asynchronously‘ to instruction processing
– The branch prediction logic can find/locate/predict future occurrences of branch-type instructions(including calls andreturns) and their corresponding directions (taken or not taken) and targets (where to go next) on its own withoutrequiring / waiting for the downstream pipeline to actually decode / detect a branch instruction包含了call/return的指示,是否包含RAS不详
– The branch prediction logic tries its best in predicting the program path much further into program code than wherethe instruction fetching unit is currently delivering instructions at (and should be way ahead of where the executionengines are executing) 这是BTB的基本运行机制,但是对于这个单元会提前取指多少cycle不详
The branch prediction logic adapts many advanced algorithms / structures in maintaining and predicting branching behaviors in program code, as seen in Figure 3, including
– First level branch target buffer (BTB1) and branch (direction) history table (BHT1)BHT中存储pc tag以及相应的2-bit 饱和计数器
– Second level target and history buffers (BTB2 and BHT2) (introduced since zEC12) with a pre-buffer (BTBP) used asa transient buffer to filter out unnecessary histories二级BHT
? Note: BHT2 is only used in zEC12
– Accelerators for improving prediction throughput (ACC) by “predicting the prediction” (since zEC12) so it can make aprediction every cycle (for a limited subset of branches)
– Pattern based direction and target predictors (PHT and CTB) to predict based on “how the program gets here” branch history (that represents the program flow), e.g. for predicting an ending of a branch on count loop, or a subroutinereturn that has multiple callersACC的含义到底是什么?原文描述为Column Predictor
? The branch prediction logic communicates its prediction results to the instruction fetching logic throughan overflow queue (BPOQ); such that it can always search ahead of where instructions are being fetched 说明了BTB单元能力减少branch penalty 的实质。
Instruction Delivery
? Since z/Architecture instructions are of variable lengths of 2, 4 and 6 bytes, an instruction can start at any half word (integral 2-byte) granularity
? Instruction fetching fetches “chunks” of storage aligned data from the instruction cache, starting at adisruption point; e.g. after a taken branch (including subroutine calls and returns), or a pipeline flush– Up to 2 16-byte chunks for z196 and zEC12; Up to 4 8-byte chunks for z13
These “chunks” of data are then written into an instruction buffer (as a “clump”), where instructions are extracted (or parsed) into individual z-instructions in program order 一次取指为2条指令,或者32 bytes的指令包,因此sequential的指令将会被放在instruction buffer 中,变长的指令集设计决定接下来的指令将从这个instruction buffer中执行非对齐操作获得。(文中用了extacted 和parsed)
Instruction Grouping
? As instructions (and μop’s) are grouped, they are subject to various grouping rules, which prevent certain instructions from being grouped with others
? In z196 and zEC12, one group of up to 3 instructions can be grouped at a time, while z13 allows two groups of up to 3 instructions at a time
? Once instructions are dispatched (or written) into the issue queue as a group, they are tracked in the global completion table (GCT) until every instruction in the group has finished processing; then the groupis completed and retired
类似VLIW 处理器的分组机制。
标签:live 说明 ram out rsa nec cache table dap
原文地址:http://www.cnblogs.com/smile-at-desert/p/7750772.html