1000/1000
Hot
Most Recent
This document presents a brief study of the Intel processor microarchitecture named Sandy Bridge. It presents the processor families, as well as the type of application for each line. At the same time, it provides information about the architecture, as well as the processor pipeline and data about the operation of the Cache memory.
Sandy Bridge[1] is the name of the microarchitecture that Intel began using for its processors in 2011. It is an evolution of the Nehalem microarchitecture, used in Core i7 processors and some of the Core i5 and Core i3 of the so-called first generation Intel Core.
The main features of the Sandy Bridge microarchitecture are:
Processors with the Sandy Bridge microarchitecture can be divided in two different ways depending on their application. We can divide processors into four groups if we think about the types of devices where they will be installed. We have the Portable processor line, the Desktop line, the Server line and the Embedded line.
However, depending on the type of use, we can divide into families Intel Core i7 Extreme (For extreme high-performance use), Intel Core i7 and Intel Core i5 (For high performance), Intel Core i3, Intel Pentium and Intel Celeron (For everyday activities ). In addition to these, we also have the Intel Xeon family for use in Servers.
The Intel Celeron is a low-cost low-performance processor, serving users with modest computer uses, running just the basics without demanding much from the machine. The Intel Pentium is slightly superior to the Celeron, but it is still suitable for those who just browse the internet and perform routine operations and do not need performance. The Intel Core i3 is a low-cost, medium-performance processor aimed at basic users, who use the computer for simple day-to-day functions, such as browsing the internet, accessing social networks, opening photos and videos and checking emails. mails. The Core i5 is aimed at intermediate use, for users who need to perform lighter image and video editing, in addition to running games. The Core i7 is aimed at users who use heavier programs on the computer, such as professional video, photo and vector editing software, playing high-quality media, as well as running games with advanced graphics.
In the Sandy Bridge microarchitecture there are four instruction decoders, making the processor capable of decoding up to four instructions per clock cycle. These decoders are responsible for decoding IA32 instructions into RISC microinstructions to be used by the processor's execution units. As with other Intel processors, Sandy Bridge supports Macro-fusion (instruction fusion) and Micro-fusion (microinstruction fusion). Through instruction fusion, the processor is able to decode two related instructions into just one, increasing performance.
A decoded microinstruction cache (L0 Cache) capable of storing 1536 microinstructions (6 kB) was added. When a program needs to repeat the execution of a group of instructions several times, the processor does not need to decode the instructions again as they are decoded in the cache, saving time and increasing performance. The cache is used about 80% of the time. When the microinstruction cache is used, the processor does not need to use the L1 instruction cache and decoders, saving energy and dissipating less heat.
The branch prediction unit has been redesigned and the size of the branch target buffer has been doubled compared to the Nehalem architecture. A new compression technique was used, allowing even more data to be stored.
The Scheduler used in the Sandy Bridge architecture is similar to the one used in the previous one, having six dispatch ports, three for execution units and three for memory units. Despite this, Sandy Bridge has 15 execution units, three more than the previous version. And they have been redesigned to increase performance in floating point operations. Each execution unit is connected to the scheduler using a 128-bit bus. In order to execute 256-bit instructions, instead of adding 256-bit units and buses, two execution units are used at the same time.
After an instruction is executed, it is no longer copied to the reorder buffer. In this case, the processor simply indicates the end of the instruction in a list, saving bits and increasing efficiency. Another architectural difference is in the memory ports, where both the address load and store units can be used as either a load unit or an address store unit, allowing twice as much data to be loaded from the L1 cache per pulse. clock speed (using two 128-bit units instead of just one), increasing performance by allowing 256 bits of data to be loaded from the L1 cache per clock cycle.
Sandy Bridge processors have a ring architecture for communicating the processor's internal components. When a component needs to communicate with another, it places the information on the ring so that it reaches the destination. This way, the components do not communicate directly, as all communication is done through the ring. The components that use the ring are the processing cores, L3 memory caches, the system agent (integrated memory controller, PCI Express bus controller, 2D video and power control unit), and the 3D video processor. Each L3 cache is not tied to a particular processing core. Any core can use either cache. There are four communication rings: The data ring, request ring, acknowledgment ring and verification ring. They are based on the QPI protocol and work with the same internal processor clock.
Turbo Boost is a technology that automatically overclocks the processor when it needs more processing power. In the Sandy Bridge architecture, the technology was revised allowing the processor to exceed its TDP for up to 25 seconds, to dissipate more heat than officially allowed. Additionally, the TDP is shared between the processor and the video processor. If one of them is not dissipating much heat, it provides its extra TDP to the other, allowing the processors to work at higher clocks and a TDP above the specified if applications are demanding more processing power.
The video processor integrated into Sandy Bridge processors is physically on the same silicon chip as the processor, having up to 12 graphics execution units depending on the processor. 2D and 3D are in separate parts of the processor, helping to save power by turning off 3D when not needed. Furthermore, the graphics engine can use the L3 cache to store data and textures, increasing 3D performance since the graphics engine does not need to go to RAM memory to fetch the data in all cases.
Both the Core i7 and i5 support Turbo Boost, with the 2600 and 2500 being able to boost frequency up to 400 MHz (with a single core active), up to 300 MHz with dual cores, 200 MHz with three cores and (a first for Sandy Bridge) a modest 100 MHz boost with all four cores active for short periods if temperature and processor TDP allow.
To simplify the design, Intel adopted a ring-shaped bus, which uses a single track circuit (forming 4 independent rings) to interconnect the four cores, the four L3 cache blocks, the GPU and the System Agent (the bridge). north of the chipset, included inside the processor). Although it brings some technical advantages, the use of a ring bus usually increases electrical consumption and the area used within the chip % (ATI adopted a ring bus without success in the R600).
It is not possible to speak with certainty about the impact on the chip's electrical consumption (since it is not possible to reliably measure the individual consumption of each component), but the way it was implemented by Intel brought important advantages.
Each ring is capable of transferring 32 bits per cycle, which results in 96 GB/s of bandwidth, which is the same bandwidth available for each core in Nehalem. However, as 4 independent rings were implemented, in practice we have a situation in which each processor has 96 GB/s available when everyone is using the cache simultaneously, but can use up to 384 GB/s in certain circumstances, when the ring is idle. The same goes for the GPU, in circumstances where the cores are idle.
Intel combined the ring bus with a low-latency L3 cache, which resulted in a latency of just 31 cycles, versus 36 cycles in Nehalem. The L3 cache also started to work at the same processor frequency, eliminating the "uncore" concept used in Nehalem. In Sandy Bridge, the frequency of the chipset's north bridge is now called "System Agent" and is no longer related to the cache frequency.