sandy bridge

时间：2016-06-11 18:45:27 阅读：248 评论：0 收藏：0 [点我收藏+]

标签：

SANDY BRIDGE SPANS GENERATIONS

Intel Focuses on Graphics, Multimedia in New Processor Design

By Linley Gwennap {9/27/10-01}

...................................................................................................................

Intel’s processor clock has tocked, delivering a nextgeneration architecture for PCs and servers. At the recent

Intel Developer’s Forum (IDF), the company unveiled its Sandy Bridge processor architecture, the next tock in its tick-tock roadmap. The new CPU is an evolutionary improvement over its predecessor, Nehalem, tweaking the branch predictor, register renaming, and instruction decoding. These changes will slightly improve performance on traditional integer applications, but we may be reaching the point where the CPU microarchitecture is so efficient, few ways remain to improve performance.

The big changes in Sandy Bridge target multimedia applications such as 3D graphics, image processing, and video processing. The chip is Intel’s first to integrate the graphics processing unit (GPU) on the processor itself.

periods. For notebook computers, these improvements can significantly extend battery life by completing tasks more quickly and allowing the system to revert to a sleep state.

Integration Boosts Graphics Performance

Intel had a false start with integrated graphics: the ill-fated Timna project, which was canceled in 2000. More recently, Nehalem-class processors known as Arrandale and Clarkdale “integrated” graphics into the processor, but these products actually used two chips in one package, as Figure 1 shows. By contrast, Sandy Bridge includes the GPU on the processor chip, providing several benefits. The GPU is now built in the same leading-edge manufacturing process as the CPU, rather than an older process, as in earlier products. This change alone provides a huge improvement

This integration not only eliminates an external chip, but it improves graphics performance by more closely coupling the GPU and the CPU. Sandy Bridge introduces the Advanced Vector Extensions (AVX), which double peak floating-point throughput. AVX will accelerate many 3D-graphics and imaging applications. The new processor also adds hard-wired video encoding.

Sandy Bridge will first appear in desktop and notebook processors that will be announced in early 2011 and branded as “2nd generation Intel Core” processors. It will later roll into a family of server processors. Because these chips use the same 32nm process as the current Westmere-based designs, increases in top speed will be modest. A focus on

power efficiency, however, enables Sandy Bridge to Figure 1. Arrandale versus Sandy Bridge. Arrandale combines a ^{achieve more performance within the same TDP}Westmere-based processor chip and a separate north-bridge chip in a limits as its predecessor. Improvements in turbo single package. Sandy Bridge combines the processor and the north mode enable even greater performance for short time bridge, including graphics, on a single chip.

SEPTEMBER 2010

in performance per watt, increasing the speed of the shader engines while reducing their power.

The GPU can now access the processor’s large levelthree (L3) cache (which Intel now calls the last-level cache, or LLC). When graphics functions executed in a simple pipeline, caching was irrelevant. A modern shader-based architecture, however, can access the same data over and over as it layers one effect over another. The L3 cache short-circuits the path to main memory, providing its information much more quickly. This extra speed improves graphics performance while reducing pressure on the limited DRAM bandwidth.

Instead of carrying forward a GPU from an earlier chipset, Intel’s graphics team in Folsom, California, totally redesigned the Sandy Bridge GPU. The biggest change is the amount of fixed-function logic. Although programmable shaders (which Intel calls execution units, or EUs) are area efficient, performing many different graphics functions in a single unit, fixed-function blocks are more power efficient for any given task. Taking advantage of the 32nm technology’s greater transistor budget, Intel added more fixed-function units for common tasks such as texturing, vertex processing, rasterization, and Z-buffering. This approach, which is widely used by other GPU vendors, allows the shaders to focus on shading; they also perform unusual tasks that are not worth implementing in dedicated hardware.

Intel also redesigned the shaders themselves. Because these units have a proprietary instruction set, the company can make radical changes to the architecture as long as it rewrites its drivers to compensate. For Sandy Bridge, Intel expanded the shader’s register set, improved the parallel branching, and added a transcendental math (e.g., sine and cosine) unit to speed up those functions. Each shader is multithreaded and can execute two instructions per cycle. The new GPU has up to 12 shader engines that operate at one-third of the CPU frequency. Each shader can generate eight single-precision floating-point operations per cycle, producing total GPU performance of 96 FP operations per GPU clock cycle, or 109 gigaflops at a CPU speed of 3.4GHz. By comparison, the shaders in Nvidia’s high-end GeForce GTX 480 GPU are rated at a total of 1,345 gigaflops, albeit at a much higher price and power level.

As Figure 2 shows, the GPU also contains a multiformat codec engine (MFX) that performs video decoding for MPEG2, VC1, and H.264 AVC at up to 1080p (HD) resolution. Although current platforms also have a videodecoding engine, this unit does not perform motion compensation and in-loop deblocking; these critical functions are dumped on the programmable shaders, resulting in greater power consumption. Like most competing hardware video decoders, Sandy Bridge’s MFX performs the entire decode process in hardware, cutting power in half compared with the previous platform.

Unlike current platforms, Sandy Bridge also performs hardware video encoding. The encoder handles the H.264 AVC and MPEG2 codecs, again at 1080p resolution. Other codecs are encoded in software on the main CPU. Shifting video encoding to the GPU’s media engine not only greatly reduces power but also improves performance. This speedup is important during video transcoding (converting from one codec to another), which can be processed faster than real-time video capture. Using the new hardware engine, Sandy Bridge can encode AVC video much faster than current platforms; Intel did not disclose how fast, but it claimed at IDF that “tasks that once took minutes will now take seconds.”

AVX Widens FPU to 256 Bits

As announced by Intel in 2008, AVX is a new set of x86 instruction-set extensions that logically follows SSE4. AVX increases the width of the 16 XMM registers from 128 bits to 256 bits; the wider registers have the tasty moniker YMM. Each register can hold eight single-precision (SP) floating-point values or four double-precision (DP) floating-point values that can be operated on in parallel using

Figure 2. Sandy Bridge graphics unit. The GPU contains a number of programmable shaders (EUs) that are assisted by several fixed-function blocks that perform common graphics and video tasks.

SIMD (single-instruction, multiple-data) instructions. AVX also adds three-register instructions (e.g., c=a+b), whereas previous instructions could only use two registers (a=a+b).

To implement the AVX instructions, Intel extended the width of the FP multiply unit and the FP add unit to accommodate the wider YMM registers. To save die area (and thus leakage current), instead of just adding more FP transistors, the lower half of the calculation uses the existing 128-bit SIMD integer data path. Of course, this approach still adds some new transistors to implement the new capability, but it adds fewer than separate units would require.

The new FPU has a 256-bit FP multiply (FMUL) unit and a separate 256-bit FP add (FADD) unit. Together, these two units can perform 16 SP operations per cycle. AMD’s Bulldozer CPU (see MPR 8/30/10-01,“AMD Bulldozer Plows New Ground”) achieves the same level of peak performance in a different way: it provides two 128bit FP multiply-add (FMA) units that can be combined to execute one 256-bit FMAC instruction per cycle. Sandy Bridge does not implement FMA (which will likely debut in the 2013 Haswell CPU), but it can achieve its peak performance on code that uses separate FADD and FMUL instructions. Note that Bulldozer’s peak performance is cut in half when using FADD and FMUL instead of FMA.

Register Files Get Physical

To produce the Sandy Bridge CPU in just two years, Intel’s Haifa (Israel) team borrowed heavily from the Nehalem CPU design, following the principle of “if it ain’t broke, don’t fix it.” Implementing AVX, however, broke the register-renaming method used in Nehalem (and its predecessor, Merom, which was also designed by the Haifa team). In these previous CPUs, each instruction stores its source values in the scheduler and its result in the reorder buffer (ROB), which copies the result to the physical register file once the instruction is retired. In-flight instructions must check the ROB (using a content-addressable memory) to locate the most recent copy of any source registers that they require.

Implementing AVX doubles the potential operand size, which could greatly increase the size of both the scheduler and the ROB. Instead, Intel decided to revert to the physical-register scheme used in the old Pentium 4 (and by most RISC processors that implement register renaming). In this method, the CPU implements a rename register file that is much larger than the logical register file. When an instruction is issued, its result is assigned to an unused entry in the rename file. Its source operands are taken from the appropriate rename registers defined by the current mapping. If an exception occurs, the CPU simply replaces the register map with the most recent committed mapping.

This approach eliminates the copying of data from the ROB to the physical register file, saving power. The rename entries for each register file are the same width as that file, so only the YMM rename registers need to be 256 bits wide; using the previous method, every entry in the scheduler and the ROB would have to be able to hold a 256-bit result. As Table 1 shows, this simplification also enabled the team to increase the number of scheduler and ROB entries by 50% and 30%, respectively, increasing the size of the reorder window.

Branch Predictors Get Small

The Sandy Bridge team focused on a few other areas to improve the CPU. Because of Nehalem’s long pipeline, accurate branch prediction is critical to its performance. Mispredicted branches not only waste many CPU cycles, they also waste the power used to execute the unneeded instructions. The team wanted to improve the accuracy of the branch predictor, but without taking a brute-force approach that required more memory and thus more power.

To do so, the team discarded the long-standing wisdom that a 2-bit predictor is better than a 1-bit predictor. Most branches are, in fact, monotonic, so the 2-bit predictor provides a relatively small improvement in accuracy—not enough, the team reasoned, to justify the cost and power of the second bit. Put another way, a 2-bit predictor is less accurate than a 1-bit predictor with twice as many entries, yet both use about the same die area and power. Sandy Bridge actually implements slightly more than 1 bit per entry, sharing a single “confidence” bit across several entries to save space and power.

The team then looked at the branch target buffer (BTB), which used 64-bit entries to hold the target addresses. Most branches, however, have very short displacements, looping back or jumping forward several instructions or so. The Sandy Bridge BTB has mostly small entries to hold these offsets, with a few wide entries for long-displacement branches. This change saved so much space that the team was able to double the number of BTB entries and still have room left over. This extra space is used to extend the global branch history, further improving prediction accuracy. As a result, accuracy is improved within about the same die area and power for the branchprediction unit.

Like Nehalem, Only Better

Like Nehalem, Sandy Bridge uses a 32KB instruction cache and decodes four x86 instructions per cycle, converting them into simpler RISC-like micro-ops. A new addition is the micro-op (L0) cache, which can hold 1.5K micro-ops. Assuming an average of 3.5 bytes per x86 instruction, this cache is the equivalent of a 5.25KB instruction cache. Intel claims that the typical hit rate for this cache is about 80%. The L0 cache replaces Nehalem’s loop buffer, which also stores micro-ops but has only 28 entries.

Because instructions already stream from the instruction cache at the pipeline’s full rate, the new design saves cycles only when execution can be restarted from the L0 cache after a mispredicted branch. The overall perform-

	Nehalem	Sandy Bridge
Load Buffers	48 entries	64 entries
Store Buffers	32 entries	36 entries
Scheduler Entries	36 micro-ops	54 micro-ops
Integer Rename File	Not applicable	160 registers
FP Rename File	Not applicable	144 registers
Reorder Buffer	128 micro-ops	168 micro-ops

Table 1. Nehalem versus Sandy Bridge. The new CPU design replaces the old physical register file, which contained a single copy of each logical register, with a new larger set of rename registers. Sandy Bridge also expands the size of the reorder buffer, the scheduler, and the load and store buffers, providing the CPU with greater opportunities to reorder instructions. (Source: Intel)

ance improvement is thus minimal. More importantly, instructions whose micro-ops are already in the L0 cache do not need to be fetched, predecoded, decoded, and converted to micro-ops. Thus, the new cache reduces the power used for these complex tasks by about 80%.

Aside from the AVX changes, Sandy Bridge uses the same execution core as Nehalem. Its 32KB data cache is backed by a 256KB level-two (L2) cache, as Figure 3 shows. Nehalem’s data cache, however, supports one 128-bit load and one 128-bit store per cycle. Sandy Bridge can perform two loads and one store per cycle, doubling the load bandwidth. This change helps support the AVX instructions, enabling the cache to perform one 256-bit AVX load per cycle. AVX stores require two cycles to complete.

Ring Around the Cores

Sandy Bridge contains a new component, the system agent, that controls what was previously called the north bridge: the memory controller, PCI Express, display interfaces, and the DMI connection to the external south-bridge chip (PCH). The exact system interfaces will vary depending on the target platform, but the initial products will support two DDR3 SDRAM channels and a single ×16 PCI Express 2.0 interface that can also be configured as two ×8 ports. The chip uses the Embedded DisplayPort interface to connect to the internal display in a notebook computer. Other display and system interfaces are in the south bridge.

Instead of the single L3 cache used in Nehalem and Westmere, Sandy Bridge divides the L3 cache into four blocks. Intel has not disclosed details, but we expect each block to be 2MB in size, totaling 8MB of L3 cache in a four-CPU processor. Although each CPU is physically adjacent to one of the cache blocks, there is no association of data between the CPU and its neighboring cache block. Instead, each CPU simply sees a single, large L3 cache. Dividing this cache into four blocks, however, allows each block to service requests simultaneously, quadrupling the total cache bandwidth. At 3.4GHz, the L3 cache can deliver 435GB/s.

To improve bus bandwidth and simplify scalability, Sandy Bridge implements a ring interconnect instead of the usual internal bus. Intel used a ring in Nehalem-EX and in its ill-fated Larrabee (see MPR 9/29/08-01,“Intel’s Larrabee Redefines GPUs”), but it has not previously used this structure in its mainstream processors. Other processors use rings—notably, NetLogic’s XLR and XLP (see MPR 7/26/10-01,“NetLogic Broadens XLP Family”).

Because each station in the ring connects only to the next station, the ring segments are short enough to operate at the full CPU speed. Furthermore, each ring segment can potentially carry different data. (The Sandy Bridge design actually includes four separate rings: one for data and three for address, coherency, and control information.) The downside of a ring is that data may need to traverse multiple hops to get to its destination. These extra hops add latency, but given the faster speed of the ring, two or even three hops can take as little time as a single transfer on a traditional multidrop bus.

Figure 3. Complete block diagram of Sandy Bridge CPU. Important changes from Nehalem include an improved branch predictor, a new micro-op cache, new AVX registers and execution units, and the ability to load two 128-bit values per cycle from the data cache.

Sandy Bridge implements a unique type of ring that minimizes the number of hops. As Figure 4 shows, the ring has 10 stations, but each CPU/cache block shares two stations, one connected to the portion of the ring going “up” and the other connected to the portion going “down.” For each transfer, the ring interface chooses the direction, up or down, that will get the data to its destination the soonest. In this configuration, the worst-case number of hops to or from any CPU or cache is four. Because a CPU can access its associated cache block without using the ring (i.e., zero hops), the average trip from a CPU to cache is only 1.25 hops. The average trip from a CPU/cache to or from the GPU or the system agent is 2.5 hops. As a result, the average L3 cache latency is 26 to 31 cycles compared with about 36 in Nehalem.

Assuming that 25% of the cache traffic passes directly through to its neighboring CPU without getting on the ring, the ring itself must handle three transactions per cycle to sustain the remaining cache bandwidth. That means, on average, either the up or down leg must handle two transactions at once, which may be difficult to schedule. Thus, the ring may not be able to sustain the full bandwidth of the caches, but it will provide much greater bandwidth than Nehalem’s internal bus.

The architecture easily scales downward by removing two of the CPUs and their associated cache blocks. This change reduces both the size and bandwidth of the L3 cache to match the reduction in CPU count. With only two cache blocks, the ring can sustain the full cache bandwidth. Alternatively, the architecture can be extended to eight or more CPUs. In this case, Intel could double the ring width to 512 bits to help support the greater bandwidth of the additional cache blocks.

Figure 5 shows a die photo of the chip in which the six ring stations can clearly be seen. Intel did not disclose the die size of the chip, but it is said to be about 225mm². The graphics and video unit at the bottom of the chip takes 20% of the total die area. This unit appears to consist mainly of synthesized logic, whereas the CPUs and caches use custom circuit design to maximize performance. The north-bridge logic, not counting the I/O drivers, is only 8% of the die. The four CPUs together use 33% of the die, making them just under 20mm² each. The 8MB of L3 cache (L3$) covers 20% of the die. Most of the I/O drivers are for the two 64-bit channels of DDR3 SDRAM. Note that removing two of the CPUs and two of the cache blocks would reduce the total die area by about 30%.

Revving Up Turbo Mode

As with Intel’s other multicore processors, the frequency of each CPU core can be adjusted individually, so it can be

turned on or off as needed. Even when its clock is turned off, however, the CPU burns leakage power. So, when the processor is truly idle, the voltage to the CPUs can be turned off, completely eliminating their power dissipation. The graphics unit is on a separate voltage plane, so it can be disabled for applications that do not require graphics. The system agent and north-bridge logic are on a third power plane, which allows the north bridge to refresh the display while the rest of the processor is asleep.

The system agent contains the power manager—a programmable microcontroller that controls the power state of all units and controls reset functions. This approach allows power-management decisions to be made more quickly than by running a driver on one of the CPUs.

If only one Nehalem CPU is active, it is allowed to exceed its rated speed, since even at a higher speed, one CPU will not exceed the multicore chip’s total TDP budget. Intel calls this feature turbo boost. Sandy Bridge takes a different view of turbo boost, focusing on die temperature rather than instantaneous power dissipation.

Consider a processor that has been sitting idle for several seconds; perhaps the user is considering what digital effect to apply to a photograph. The processor and its thermal system (e.g., heat sink) will cool down. If the user

Figure 4. High-level block diagram of Sandy Bridge. The L3 cache is divided into four blocks, which are connected to the four CPU cores, the GPU, and the system agent through a ring interconnect.

Figure 5. Sandy Bridge die photo with overlay. The red boxes show the ring stations. The 12 shaders (EUs) can be seen at the lower left. (Source: photo by Intel, overlay by The

Linley Group)

then begins a CPU-intensive activity, the entire processor can operate well above its rated TDP for quite some time (up to 25 seconds, Intel claims) while the thermal system heats back up. Once the chip reaches its maximum operating temperature, it must throttle back to its rated TDP level to avoid overheating.

Sandy Bridge’s power manager estimates the chip’s temperature and calculates the degree and duration of the turbo boost. In this design, the maximum CPU speed in turbo mode is limited only by the physical speed of the transistors and the current that can be supplied to the chip. Turbo mode should be particularly effective in mobile Sandy Bridge chips, since these chips have more headroom. Because the chip does not physically measure its own temperature, the power manager estimates the temperature on the basis of recent workloads and the thermal characteristics of the device and its cooling system. Presumably, Intel has built in a safety factor, but any unauthorized or even unintentional modification of the cooling system could wreak havoc with the power manager’s calculations.

The integration of GPU and CPUs on one chip creates additional turbo opportunities. If the GPU is not in use or is operating at a low speed, the power manager can assign some of its thermal budget to the CPUs, allowing them to run at a higher clock speed. Conversely, on a graphics-intensive application, the power manager can overclock the GPU while slowing down the CPUs.

Turbo mode should provide a noticeable speedup to PC users—up to 40% in some situations. Users who often leave their system idle or in a low-power state before launching a brief burst of computation will particularly benefit. Gamers or professional users who execute long CPU-intensive jobs will not see as much gain. Benchmarking a Sandy Bridge processor could generate widely different results depending on how the test is run. Intel recommends allowing the processor to rest quietly for a moment before running each benchmark, but this approach may not be indicative of real-world usage.

Sandy Bridge Transition in 2011

As with previous generations, Sandy Bridge will sweep through Intel’s entire product line, but the transition will take some time. The initial products, which are due to appear early next year, will be in the Core i3, Core i5, and Core i7 product lines. Although Intel has not announced details, leaked roadmap slides provide the initial product configurations. According to this information, these products will offer two or four CPUs at speeds of up to 3.4GHz. The high-end 3.4GHz four-core version will top out at 95W TDP. Mobile versions with four cores at 2.5GHz will use 55W TDP, with a dual-core 2.7GHz version rated at just 35W TDP. Some models (probably the dual-CPU versions) will use 6 shaders instead of 12, reducing graphics performance.

Together, these Sandy Bridge products will quickly displace the dual-core Arrandale (mobile) and Clarkdale (desktop) processors, as Figure 6 shows. Some high-end PCs are still using the quad-core Clarksfield (mobile) and Lynnfield (desktop) processors, which are built in 45nm technology; these products can convert to the quad-core Sandy Bridge processors.

Sandy Bridge will cross into the server market in 2H11 using a platform called Romley. At IDF, Intel demonstrated Romley-EP, a dual-socket (2P) platform that succeeds Tylersburg-EP, using a version of Sandy Bridge with eight CPUs. An eight-CPU version uses more die area than the six-CPU Westmere chip, but Sandy Bridge’s lower power per CPU should allow it to fit into a similar thermal budget (TDP).

For 4P and 8P systems, Intel is preparing the RomleyEX platform, which follows Westmere-EX (see MPR 9/20/10-01,“Intel Unveils Atom E6xx, Westmere-EX”) and the Boxboro-EX platform. The new high-end platform

Figure 6. Intel processor roadmap. In early 2011, PC users will convert from Intel’s Arrandale and Clarkdale processors to Sandy Bridge. In 2H11, servers will convert to the Romley platform, which supports new versions of Sandy Bridge.

includes the Sandy Bridge-EX processor, which we expect to have 12 CPUs.

Intel has disclosed little about Romley but says it will use a new version of Intel’s Quick Path Interconnect (QPI) to connect the processors to each other. Sandy Bridge-EP will include two QPI v1.1 links for this purpose. Intel has not specified the speed of the new version, but QPI v1.0 delivers 12.8GB/s in each direction at a clock rate of 3.2GHz. We expect the new links provide a modest increase, perhaps to 3.8GHz. The processor will still use DMI to connect to the south bridge, which in the Romley platform is code-named Patsburg.

We expect the server versions of Sandy Bridge to have some minor changes in addition to the greater number of CPUs. The on-die graphics unit will probably be omitted. Whereas Westmere-EP has three DRAM channels for six CPUs, Sandy Bridge-EP could use four channels to provide a better match for its eight CPUs. Although the initial Sandy Bridge processors will integrate PCI Express 2.0, the server versions provide a good opportunity for Intel to debut the faster PCI Express 3.0. The new specification, which is due to be completed in late 2010, doubles PCI Express throughput to 8Gbps per lane, or 16GB/s for a 16lane slot.

Intel has already taped out its first 22nm processor. Following the tick-tock plan, Ivy Bridge is a 22nm shrink of Sandy Bridge that is scheduled to enter production in late 2011. As a “tick,” Ivy Bridge is not expected to add many new features to the platform, but the 22nm process will significantly reduce power per CPU and enable Intel to put more CPUs on a chip. Following Ivy Bridge is Haswell, the next tock. Little is known about Haswell, but it is said to be a completely new CPU microarchitecture.

Intel Steamrolls Bulldozer

AMD “coincidentally” disclosed its next-generation Bulldozer architecture just one month before Intel disclosed Sandy Bridge. Thus, it is natural to compare the two designs. Although AMD has beefed up its microarchitecture, Bulldozer matches up better against Nehalem/Westmere than it does against Sandy Bridge, as Table 2 shows. Intel has again upped the ante by expanding the reorder capabilities of its CPU. Sandy Bridge’s micro-op cache is a unique feature that offers performance and power-savings opportunities. We also suspect Intel’s branch prediction is superior, although neither vendor has disclosed enough information to be sure. Bulldozer’s 16KB data cache may also hinder it on certain single-thread applications.

On dual-threaded applications, Bulldozer’s unique two-core module offers clear benefits over Intel’s HyperThreading design. Sandy Bridge must share its execution units and reorder buffer between the two threads, whereas

Bulldozer offers each thread a separate set of resources. But

	AMD Bulldozer	Intel Westmere	Intel Sandy Bridge
Building Block	Dual CPU	CPU core	CPU core
Integrated Graphics?	Not disclosed	In package	On chip
Instruction Extensions	SSE1–SSE4.2, AES, AVX	SSE1–SSE4.2, AES	SSE1–SSE4.2, AES, AVX
Number of Threads	2 threads	2 threads	2 threads
Instr Decode Rate	4 per cycle	4 per cycle	4 per cycle
Micro-Op Issue Rate	4 per cycle*	4 per cycle	4 per cycle
Instr Reordering	128 micro-ops*	128 micro-ops	168 micro-ops
FP Operations (SP)	16 per cycle	8 per cycle	16 per cycle
L0 (micro-op) Cache	None	None	1.5K micro-ops
L1 Cache (I/D)	64KB/16KB†	32KB/32KB	32KB/32KB
L2 Cache	Undisclosed	256KB	256KB
IC Process	32nm	32nm	32nm
First Production	Mid-2011‡	Early 2010	Early 2011

AMD counts each Bulldozer module as two cores, so all we can say is that two Bulldozer

Bulldozer is headed straight for the server market in the Valencia and Interlagos processors. Thus, Bulldozer may actually appear in servers before Sandy Bridge does (in the Romley platform).

Whereas Sandy Bridge is the clear winner on a coreto-core basis, a Bulldozer core will use much less power, allowing AMD to operate Valencia at a higher clock speed within the same thermal envelope. Sandy Bridge-EP should handily outperform Valencia on SPECint per watt, although the Bulldozer chip may narrow the gap from current products.

Rising to the Next Level

Sandy Bridge brings several architectural innovations to Intel’s platforms. PC users will see the biggest improvements in multimedia performance, particularly video encoding but also graphics and image processing. On these types of applications, which are becoming increasingly popular, users will see an improvement of 2× or better compared with Intel’s previous graphics solutions.

The CPU microarchitecture improvements are fairly subtle, given the evolutionary design approach. We expect performance per megahertz to improve very modestly, perhaps by 10% on most applications. Many of the design changes, particularly the L0 micro-op cache, aim to reduce power more than to increase performance. Intel has not quantified these gains, but we estimate they could trim CPU power by 10% at the same clock speed. Because most of the company’s products are power limited, Intel can turn power efficiency into higher clock speeds at the same TDP levels, providing an effective performance boost of 20% for Sandy Bridge compared with Westmere. In certain situations, the improved turbo boost will magnify this gain by as much as 40%, which can translate directly into longer battery life for notebook computers.

cores are better than one Sandy Bridge.

Although AMD has been talking about integrating graphics for years, Intel has actually been doing it, first with the two-chip module of Arrandale and Clarkdale and then with the single-die Sandy Bridge. AMD’s first Fusion chip, code-named Ontario (see MPR 8/30/10-02,“Bobcat Snarls at Atom”), is scheduled to debut in 4Q10, only slightly before Sandy Bridge. We do not expect the first Bulldozer processors to integrate graphics. Sandy Bridge should negate any gains AMD might have achieved from its Fusion project.

Sandy Bridge products will appear in early 2011, whereas we expect the first Bulldozer products to ship in mid-2011. Although this schedule gives Sandy Bridge a head start of several months in the desktop and mobile markets,

Table 2. Comparison of AMD Bulldozer and Intel Sandy Bridge CPUs. Greater instruction reordering and an L0 cache should improve Sandy Bridge’s performance over Bulldozer’s. *Dedicated amount for each thread; †L1 data cache dedicated for each thread. (Source: vendors, except ‡The Linley Group estimate)

For More Information

Intel has not yet announced any products using the Sandy Bridge architecture. We expect the first Sandy Bridge products to ship in early 2011. For more information on Sandy Bridge, download Intel’s IDF presentations fromwww.intel.com/go/idfsessions (search for Sandy Bridge). For complete information on the AVX extensions, accesshttp://software.intel.com/en -us/avx.

The integration of graphics has several effects. It increases performance while greatly reducing power: we expect a performance gain of about 2× at one-fifth the power compared with the Arrandale/Clarkdale graphics unit. Integration is likely to reduce manufacturing cost compared with the current two-chip module (or with an external GPU chip). Graphics integration also makes PC makers less likely to choose a third-party graphics solution. Although the performance of Intel’s graphics has generally been modest at best, the improvement in Sandy Bridge makes it acceptable for most PC users. Gamers and others who care about maximizing graphics performance will still opt for an external graphics card.

Intel’s rollout schedule brings Sandy Bridge to the PC market in early 2011, but it will not reach the server market until late that year. In targeting the PC market first, Intel satisfies the bulk of its customer base and should extend its lead over AMD in PCs. But AMD’s focus on the server market may allow Bulldozer to beat Sandy Bridge to that market. Furthermore, AMD’s strategy of making Bulldozer a smaller, lighter core may prove more power efficient than Intel’s bulkier Sandy Bridge. AMD has been successful with server customers by undercutting Intel on price; Bulldozer may provide it with an additional advantage in that market.

Intel’s track record of bringing new processor architectures to market is impressive and unrivaled. No other processor vendor can completely remake its product line on an annual basis, yet Intel has done just that in each of the past four years. With Sandy Bridge, the company is poised to go five for five. The new processor shows that Intel is thinking outside of the CPU box, delivering new graphics and multimedia capabilities to support emerging usage models. Taking advantage of the growing transistor budget provided by its industry-leading fab technology, the company is integrating new features along with additional system functions, improving efficiency. Sandy Bridge will

carry Intel into the next generation of PCs. ♦

To subscribe to Microprocessor Report, access www.MPRonline.com or phone us at 408-945-3943.

sandy bridge

标签：

原文地址：http://www.cnblogs.com/ruxiaobai/p/5575621.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行