AMD Bulldozer en exclusiva
David Sarmiento 02 Abril 2010
la mejor de las bromas de este april fool´s day
Hoy es un día muy especial en el sector de los microprocesadores. AMD nos ha permitido dar la primera mirada a su próximo núcleo Bulldozer.
Descripción de la arquitectura Bulldozer
La vista esquemática siguiente proporciona detalladamente la próxima arquitectura Bulldozer tal como se mostró en el evento Analyst Day en noviembre del año pasado. Entre sus mejoras arquitectónicas cabe resaltar sus grandes mejoras en el esquema de memoria cache: Memoria caché L1 de 16 KB (4-vías) caché de datos por núcleo con un estado de latencia por ciclo ciclo y 128 kB (4-vías) caché de instrucciones por módulo, Cache L2 de 2 MB (8-vías) por módulo (compartido entre dos núcleos), a toda velocidad, cache L3 de 8 MB compartida entre todos los núcleos con una latencia de 24 ciclos será capaz de atender hasta 2 solicitudes simultaneas por ciclo de reloj. El microprocesador usará el proceso de manufactura de 32nm, y el socket a usarse es el AF1 de 1591 pines, estos microprocesadores tendrán un controlador de memoria de cuádruple canal DDR3, el cual puede funcionar también en los modos triple, dual y single channel. Bulldozer tendrá 2 microarquitecturas enfocadas a distintas gamas de mercado, la primera es Intergalos la cual está enfocada a servidores (Opteron), la restante es Zambezi la cual está enfocada al consumidor (Phenom).
AMD EDO: Modo Turbo con esteroides
Con Bulldozer AMD introduce EDO Performance Accelerator, el cual ofrece dos niveles para ajustar la potencia dinámica y, posteriormente, aumentar la potencia y la frecuencia de los núcleos que permanecen activos. AMD garantiza un mínimo de 30% -35% de incremento al performance sobre la frecuencia de serie. Ejemplos:
2,0 GHz> 2,6 GHz (multiplicador de Base [200 MHz], de 10 + 3 para un incremento del 30%)
2,4 GHz> 2,6 GHz (multiplicador de Base [200 MHz], de 12 + 4 para un aumento del 33%)
2,8 GHz> 3,8 GHz (multiplicador de Base [200 MHz], de 14 + 5, para un aumento del 35%)
3,2 GHz> 4,2 GHz (multiplicador de Base [200 MHz], de 16 + 5, para un aumento del 31%)
3,6 GHz> 4,6 GHz (multiplicador de Base [200 MHz], de 18 + 6 para un aumento del 33%)
4,0 GHz> 5,6 GHz (multiplicador de Base [200 MHz], de 20 + 6 para un aumento del 30%)
Pero no acaba ahí. AMD también ha reconocido a los entusiastas y ofrece un segundo nivel de overclocking. Cuando el procesador está funcionando por debajo de sus límites térmicos y eléctricos y las demandas carga de trabajo del usuario de rendimiento adicional, la frecuencia de reloj del procesador de forma dinámica se incrementará en 200 MHz en intervalos cortos y regulares. Ejemplos:
2,0 GHz> 3,2 GHz (multiplicador de Base [200 MHz] de 10 + 3 + 3 para un aumento del 60%)
2,4 GHz> 3,8 GHz (multiplicador de Base [200 MHz], de 12 + 4 + 3 para un aumento del 58%)
2,8 GHz> 4,4 GHz (multiplicador de Base [200 MHz], de 14 + 5 + 3 para un aumento del 57%)
3,2 GHz> 4,8 GHz (multiplicador de Base [200 MHz], de 16 + 5 + 3 para un aumento del 50%)
3,6 GHz> 5,4 GHz (multiplicador de Base [200 MHz], de 18 + 6 + 3 para un aumento del 50%)
4,0 GHz> 5,8 GHz (multiplicador de Base [200 MHz], de 20 + 6 + 3 para un aumento del 45%)
Habiendo concluido con los detalles técnicos, ahora viene la parte más esperada por muchos, las primeras pruebas que muestran el rendimiento de esta nueva familia de CPUs de AMD.
Benchmarks Sintéticos Sandra
Como se puede ver el circuito de Interlagos domina el campo debido a su cuádruple canal DDR3 y al nuevo diseño de su unidad de punto flotante. Zambezi supera a los antiguos Phenom II, pero sigue sin poder vencer a los i7.
Benchmarks Sintéticos SYSmark
Test de juegos
Consumo de energía
Hemos llegado al final y la cuestión es "¿cuánta energía se requiere para todo este asombroso rendimiento?"
Como se puede ver, Bulldozer establece un nuevo estándar de muy baja potencia en el modo de inactividad.
Como hemos visto, AMD está funcionando a toda máquina, dominando en casi todos los benchmarks, a pesar de lo temprana de esta revisión del núcleo.
Lastimosamente hemos tenido que resumir mucho del artículo, pero a continuación en la siguiente página les brindamos la nota completa de AMD en inglés.
AMD Bulldozer Exclusive
Today is a very special day in the microprocessor industry. AMD has given us here at the AMDZone an unprecedented first look at the upcoming Bulldozer core, answering a lot of the miss information of there on the net today about the upcoming core. The first question we asked why us and why announce now when AMD has had a history of not release any information about their upcoming products until it is more or less ready to release.
The first part of the question, why us. That was answered by John Fruehe of AMD's server marketing division "AMDZone has had as long track record of helping AMD owners with general questions and in getting the truth out about their [AMD] products, as well as being the seconds largest AMD self help forums second only to the AMD forums itself. We find it only fitting that we should announce the upcoming Bulldozer and Bobcat cores on AMDZone."
When asked why now? John Fruehe answers again that Simply put the Bulldozer is so far ahead of the performance curve that we feel confident that even Intel with its next generation chip will not be able to match the likeness of the bulldozer design.”
AMD was gracious enough to provide us with test boxes of the upcoming system, the chips are the Bulldozer A6 Stepping.
Bulldozer Architecture Overview
The following schematic view has been given to us to provide an even more detailed view at the upcoming Bulldozer module architecture than what has been shown on the Financial Analyst Day last November.
So as you can see, there are 4 full integer pipelines per core, capable of doing up to 4 micro op per cycle or in the case of 1-2 micro op can run two branches eliminating branch misperdiction.
It can fetch from several threads (program pointers) alternatingly including possible branch targets. For that to work the branch prediction unit (BPU) tries to identify branches and their targets and controls the working of the IFU. If the instruction queues of the units to be fed are already filled at high levels, the IFU/BPU pair tries to prefetch code to avoid idle cycles. Having prefetched the right code bytes in 50% of all fetches is still better than having no code ready at all. In reality this number is even better.
After a block of 32 code bytes is fetched and queued in an instruction fetch queue, the decode unit receives such a packet each cycle for decoding. To decode it quickly, it has four dedicated decode subunits, where each of them can decode most x86 and SIMD instructions on it’s own and quickly (1 per cycle and subunit). More seldomly used or complex instructions are decoded using microcode storages (ROM and SRAM). This can happen in parallel with the decoding of the „simple“ instructions. There are no inefficiencies like in K10. XOP and AVX instructions are either decoded one per subunit (if operand width is <= 128 bit) or one per two subunits (256 bit, similar to the double decode of SSE and SSE2 instructions in K8). The result are „double mops“ (pairs of micro ops, similar to the former MacroOps). After finishing decoding, the double mops (which can have one unused slot) are sent to the dispatch unit, which prepares packets of up to four double mops (dispatch packet) and dispatches them to the cores or the FPU depending on their scheduler fill status. Already decoded mops are also written to the corresponding trace cache to be used later if the code has to be executed again (e.g. in loops). Thanks to these caches, the actual decoding units are free and can be used to decode code bytes further down the program path. If a needed dispatch packet is already in the cache, the dispatcher can dispatch that packet to the core needing it and in parallel dispatch another packet (from the decoders or the other trace cache) to the second core. So there won't be any bottleneck here.
The schedulers in the cores or FPU select the mops ready for execution by the four pairs of ALUs and AGUs per core, depending on available execution resources and operand dependencies. While doing that, there is more flexibility than was in the K10 with it’s separate lanes and the inability of µOps to switch lanes to find a free execution ressource. To save power, the execution units are only activated, if mops needing them become ready for execution. This is called wakeup.
The integer execution units - arithmetic logic units (ALUs) and address generation units (AGUs) - are organized in four pairs - one per instruction pipeline. They can execute both x86 integer code, memory ops (also for FP/SIMD ops) and, which is the biggest change, can be combined to execute SSE or AVX integer code. This increases throughput significantly and frees the FP units somewhat. The general purpose register file (GPRF) has been widened to 128 bit to allow for such a feature. The registers will be copied between GPRF and the floating point register file (FPRF) if an architectural SIMD register (the registers specified by the ISA) is used for integer first and floating point later or vice versa. Since this doesn't happen often, it has practically no impact on performance. Instead the option to use the integer units for integer SIMD code (SSE, XOP and AVX) the overall throughput of SIMD code increases dramatically.
The FPU contains the already known two 128 bit wide FMAC units. These are able to execute either one of the new fused multiply add (FMA) instructions or alternatively an floating point add and mul operation (or other types of operations covered by the internal fpadd and fpmul units). This ability provides both a lower energy consumption and higher throughput for the simpler operations. As AMD already stated, the two 128 bit units will be either used in parallel by the two threads running on the integer cores but could in cycles, where one core doesn't need the FPU, both be used by only one thread, increasing it's FP throughput. This happens on a per cycle basis and resembles some form of SMT. The FPU scheduler communicates with the cores, so that they can track the state of each instruction belonging to the threads running on them.
Both the integer and the floating point units need data to work with. This is provided by the two 16k L1 data caches. Each core has it's own data cache and load store unit (LSU). The load store unit handles all memory requests (loads and stores) of the thread running on the same core and the shared FPU. It is able to serve two loads and one store per cycle, each of them up to 128 bit wide. This results in a load bandwidth of 32B/cycle and a store bandwidth of 16B/cycle - per core. A big change compared to the LSU of the K10 is the ability to do data and address speculation. So even without knowing the exact address of a memory operation (which isn't known earlier than after executing the mop in an AGU), the unit uses access patterns and other hints to speculate, if some data is the same as some other data, where the address is already known. And finally the LSU is also able to do execute all memory operations out of order, not only loads. To make all this possible with not too big effort the engineers at AMD added the ability to create checkpoints at any point in time and go back to this point and replay the instruction stream in case of a misspeculation.
To reduce the number of mispredicted branches and the latency of the resulting fetch operations, the branch predictors have been improved. They are able to predict multiple branches per cycle and can issue prefetches of code bytes, which might be needed soon. Together with the trace caches, it is often possible, that even after a branch misprediction (which is only known after executing the branch instruction), the correct dispatch packets are already in the trace cache and can be dispatched from there with low latency.
One big feature, which improves performance a lot, is the ability to clock units at different clock frequencies (provided by flexible and efficient clock generators), to power off any idle subunit and to adapt sizes of caches, TLBs and some buffers and queues according to the needs of the executed code. A power controller keeps track of load and power consumption of each of the many subunits and adapts clocks and units as needed. Further it increases throughput and power consumption of heavily loaded units as long as the processor doesn't exceed it's power consumption and temperature limits. For example if the queues and buffers of core 0 are filled and the FPU is idle, then the power controller will switch off the FPU (until it will be waked up for executing FP code) and increase the clock frequency of core 0. If core 0 has not that many memory operations (less pressure on cache), the cache might be downsized to 8kB, 2-way by switching off 2 of the 4 ways it has. This way the power, the processor is allowed to use, will be directed to where it is needed and not to drive idle units. This is called Application Power Management as you might heard in some rumors on the net.
Finaly if all else fails AMD's Bulldozer has an aggressive cache system.
L0 cache: 4 kB (8-way associative) trace cache for each thread (or core)
L1 cache: 16 kB (4-way) data cache per core 1 cycle latency and 128 kB (4-way) instruction cache per module
L2 cache: 2 MB (8-way) per module (shared between two cores), full-speed
L3 cache: 8 MB shared between all cores, the L3 cache with a latency of 24 cycles will be able to serve up to 2 requests per (NB) clock cycle simultaneously and transfer data with 16B/clock to each of the reciepients.
L4 cache: AMD has also announced that all Black Editions and Opterons models will come with 32/64MB L4 cache made possible through chip stacking.But now a bit more about the details. The instruction fetch unit (IFU) fetches code from the L1 instruction cache (at 32 Bytes/cycle).
End of an era, last of the AM Sockets and Half node Lithography process?
One thing that has been confirmed is that the Bulldozer core has been shown to be incredibly resilient during early manufacturing samples, so much so that AMD has told us that it is experimenting using the 28nm bulk silicon process with only small changes to the die. Although AMD has not confirmed that they will use half stepping, they didn’t deny it either.
32 nm SOI with Immersion Lithography
28 nm SOI with Immersion Lithography half node process
One of the first things that AMD confirmed for us was that the upcoming AM3+ will be the last of the pin grid array (PGA) and that all upcoming processor would be land grid array (LGA) more on that later.
The current AM3 Phenom II only use 938 pins, the upcoming Bulldozer and Bobcat cores will use all 941 pins. One of the benifiets of the AM3+ is that all AMD chips will be able to use DDR3 1866 (PC3 -15000) for staggering 60000MB/s as well as having advanced powersavings.
Going forward AMD’s will implement AMD’s Future socket or AMD’s Fusion socket, (Socket AF1) a massive Socket 1591 pins sock that will be the first of AMD’s next generation sock supporting DisplayPort 1.2, Full PCI Express 3.0 32 lane, and the addition of two more DDR channels allowing for quad channel memory.
AMD’s EDO: Turbo on Steroids
Much has been talked about why AMD up unit the release of the Phenom II X6 “Thuban” did not include some sort of automatic speed boosting technology. AMD has often commented that the Intel’s approach is mediocre at best because there was no way to guarantee that the chips would be able provide the speedboot. It was craps shoot that depending on such things as general thermal and power limitations.
The introduction of EDO performance accelerator with the Bulldozer CPUs the next logical step for AMD. It is much akin to Intel’s SpeedStep Technology, but with a consistent ability to overclock. How can AMD guarantee consistent overclocking? It turns out the secret is how each Bulldozer module handles power. But before we discuss that we need to understand how Intel accomplishes their Turbo, and its ultimate limitations.
The Core i7-975 Extreme Edition @ 3333 MHz has a Turbo value (1/1/1/2) meaning the maximum speed per core is 3600 MHz if only one core is active and 3466 MHz if any additional cores are active. This in turn give Intel a maximum 8% performance boot to a single core, or only a 4% boot if more then one core is running, and again this is also based on the assumption that the chip is running under the maximum general thermal and power limitations for the design. Obvious if the systems cooling or power is over budget, Intels SpeedStep would fail to kick in.
The Bulldozer design offers two levels to dynamical adjust the power and subsequently increase the power and frequency of the remaining active cores.
The processors will automatically disable the second idle core in the module and. It is activated when the operating system requests the highest performance state of the processor. The technology is completely hardware-based, and will work transparently on any operating system.
The first level is within each Bulldozer module. Each module is cable of dynamically adjusting the power to either of the integer pipelines; this allows one side of the module to be powered down into its rest state and the remaining power to be transferred to the active core. Because the power across the module remains consistent across the module it will never exceeds its power and thermal.
AMD guarantees a minimum 30%-35% (depending on the model) over clocking per core when the second one within the module is at it halt state.
2.0 GHz > 2.6 GHz (Base multiplier [200 MHz] of 10 + 3 for a 30% increase)
2.4 GHz > 2.6 GHz (Base multiplier [200 MHz] of 12 + 4 for a 33% increase)
2.8 GHz > 3.8 GHz (Base multiplier [200 MHz] of 14 + 5 for a 35% increase)
3.2 GHz > 4.2 GHz (Base multiplier [200 MHz] of 16 + 5 for a 31% increase)
3.6 GHz > 4.6 GHz (Base multiplier [200 MHz] of 18 + 6 for a 33% increase)
4.0 GHz > 5.6 GHz (Base multiplier [200 MHz] of 20 + 6 for a 30% increase)
But it doesn’t end there. AMD has also recognized the enthusiasts and is offering a second level of over clocking. When the processor is operating below its thermal and electrical limits and the user's workload demands additional performance, the processor clock frequency will dynamically increase by 200 MHz on short and regular intervals until the upper limit is met or the maximum possible upside for the number of active cores is reached. Unlike the guaranteed module speed boosting technology, this one is more in line with the Intel’s Trubo, and works on a per module basis. If an entire module is placed into its lowest powerstate, then the other models can be overclocked by up to three speed grades. Because this level power management is overclock the remaining modules to the maximum possible level that is determined by general thermal and design power
Conversely, when any of the limits are reached or exceeded, the processor frequency will automatically decrease by 200 MHz until the processor is again operating within its limits
Examples: assuming that one module is in the lowest power state and the others are only running one thread.
2.0 GHz > 3.2 GHz (Base multiplier [200 MHz] of 10 + 3 + 3 for a 60% increase)
2.4 GHz > 3.8 GHz (Base multiplier [200 MHz] of 12 + 4 + 3 for a 58% increase)
2.8 GHz > 4.4 GHz (Base multiplier [200 MHz] of 14 + 5 + 3 for a 57% increase)
3.2 GHz > 4.8 GHz (Base multiplier [200 MHz] of 16 + 5 + 3 for a 50% increase)
3.6 GHz > 5.4 GHz (Base multiplier [200 MHz] of 18 + 6 + 3 for a 50% increase)
4.0 GHz > 5.8 GHz (Base multiplier [200 MHz] of 20 + 6 + 3 for a 45% increase)
Muy interesante artículo, felicitaciones a todos los que han llegado hasta aquí, sólo me faltó comentar que esta es la mejor y más elaborada broma por el día de los inocentes que jamás he visto... Antes de que empiecen a formar airadas turbas con cuanta arma encuentren a su paso, lamento comunicarles que en estos momentos estoy emprendiendo un viaje a un país que no revelaré.
Link: AMD Bulldozer Exclusive (AMD)