This chapter is a short introduction to the different system aspects of energy management. While aiming to reduce consumption, everything in system design is impacted, from the algorithm to the elementary logic gate. Entire books have been dedicated to this subject. In this chapter, we will describe only some of the important approaches in integrated circuit architecture. A common characteristic of all these developments is an increased flexibility in dynamic choice of electrical functioning parameters. The technological aspects regarding the electrical and optical interconnects will be given at the end of this chapter.
This section is a review of the techniques implemented for reducing dynamic power. Some of these are generic and others depend on the technology. Parallelism and supply voltage adaption (voltage scaling) are more general.
Parallelization techniques are possible at all levels. At the architectural level for a multi-processor system, they are used for increasing the number of computing and memorization blocks. However, they can also be used at the more basic level of a data path in a logical function.
Figure 5.1 compares two solutions on a global system level. The first of these uses a couple made up of a processor and a memory, both of which are relatively fast and operate at a frequency of f. The second uses four memory-processing blocks, each of which operates at a frequency of f/4. The two architectures are assumed to have the same computing capacity.
Figure 5.1. Parallelism and active power
If the frequency was the only difference between the two architectures, the dissipated dynamic power would be the same in both: for the single-processor and
for the four-processor structure. In order to explain the benefits of the multi-process system, we need to be aware that the voltage can be reduced by a factor of k (which is still less than four) and that the average capacitance of a gate can be reduced (the lines between the computing and memory blocks can be shorter). On the whole, there is a potential gain, which explains the large-scale development of multi-process systems since 2005.
This same principle can be implemented in the data path, as shown in Figure 5.2. In the same way, a reduction in power is explained by a possible reduction in voltage supply, while maintaining the same data rate.
The same parallelism principle can be applied to data transfer between memory and computing unit, by making the same hypothesis that the data corresponding to the successive instructions is stored in different memories. Dissipated power, therefore, can be reduced by a protocol which intertwines reduced frequency exchanges.
Similar techniques can also be applied to improve the energy performance of shift registers and flip-flops.
Figure 5.2. Parallelization in a data path
The general principle here is to adapt the supply voltage to the specific needs of a logic gate set. For example, in a logical system that uses two different supply voltages, it is possible to apply the highest voltage to the gates placed on a critical path between two registers. The other gates can operate at a weaker voltage with a tolerable increase of propagation delays. We must exercise caution when applying this technique, in order to ensure that static currents do not circulate between nodes with different voltages.
This technique is mainly adapted to transporting signals to a long distance. If the equivalent line capacitance is very high, it may be advantageous to reduce the voltage difference between the two states by inserting specific circuits that are capable of performing voltage conversion. The dissipated power in these circuits obviously needs to be less than the power gained from reducing the voltage difference.
Pre-calculation techniques enable us to perform a first screening of the input data in combinational logic and potentially to block the other data, which allows the combinatory logic’s activity rate to be reduced. This principle is illustrated in Figure 5.3.
Figure 5.3. Predicting and reducing consumption
Functions g1 and g2 guarantee the complete determination of the function F. If g1 is in the “on” state, the function F is in the “on” state and if g2 is in the “on” state, the function F is in the “off” state. The variables Ak+1 to AN can then be frozen. In the following cycle, the output values of the register associated with these variables remain unchanged. Other methods are related to optimizing finite state machines.
Synthesis methods that help harmonize the different propagation times in combinational logic also improve energy performance. The reason for this is that they eliminate transitory signals due to variation in propagation time and thus also the resulting dissipation. Other synthesis techniques consist of linking gates with high activity rates to weak capacitive charges. This example shows the importance of routing for optimizing the energy performance of logic circuits.
This is the second component of design techniques that are aimed at weakening dissipation. They have a particular importance in advanced silicon technology development, because static power does not cease to grow, attaining values of approximately 40% in some configurations. Chapter 4 laid out the groundwork for understanding the origins of this value.
This technique is in fact the same as that presented in the dynamic section. It consists of taking maximum advantage of the possibilities of technology able to create transistors with different thresholds. In dynamic power management, choosing a weak threshold voltage will result in weak dynamic dissipation. In order to minimize static energy, we need to increase the threshold voltage.
In using silicon on insulator (SOI) technology, there are certain techniques implemented. Threshold voltages can be modified by simply applying a voltage to the back part of the transistor. This technique is made possible by the weak silicon thickness, although the electrostatic effect on a reasonable value voltage is enough to significantly modify the threshold voltage. Chapter 3 explained this phenomenon called the “body effect”, but SOI technology makes it very efficient to use.
Control voltage may be applied “statically”, meaning over a long time compared with the system’s computing time. A simple analysis allows us to choose transistors, in which the threshold voltage and sub-threshold current need to be modified. This operation can also be done “dynamically” according to the circuit’s logical activity, which refines the optimization techniques. This technique has brought about some remarkable performances in signal processors operating at a frequency of 400 MHz [BEI 15]. Some gates can function at voltages as weak as 400 mV.
The supply voltage effect is generally presented in the realms of dynamic power management, but taking short-channel effects into account, such as DIBL, shows that static power can also vary significantly with the supply voltage. Voltage variation can also be applied statically or dynamically.
Some configurations can be particularly beneficial, such as those that render the voltage between the NMOS gate and the source negative. This is the case when a large amount of NMOS transistors are put into a series. Figure 5.4 shows this, using a four-input NAND function, when all the inputs are in the “off” state.
There are different voltages appearing on the transistor sources, leading to the gate–source voltages with negative values depending on the amount of transistors in the “off” state. When some inputs are in the “on” state, the voltages change and the sub-threshold current increases. Therefore, in this case, there is a major interest in designing logic that allows as many NMOS transistors as possible to be in the “off” state.
Figure 5.4. Transistor chain and sub-threshold current
The following table indicates the sub-threshold current of different gates, due to different input vectors.
Type of gate | Vector inputs | Current in nA | Comment |
4-input NAND | 0 0 0 0 | 0.6 | Good |
1 1 1 1 | 24 | Bad | |
3-input NOR | 1 1 1 1 | 0.13 | Good |
0 0 0 0 | 29 | Bad | |
Adder | 1 1 1 | 8 | Good |
0 0 1 | 63 | Bad |
It is useful to design logic in such a way that the amount of transistors in the “off” state is the highest possible average for NAND, but the lowest possible average for NOR.
This is a very popular technique for managing the standby mode. This particular mode consists of abstaining from using certain logic blocks for determined periods of time. A possible solution is to cut the supply to the blocks. The other solution using “sleep transistors” consists of adding one or two isolation transistors, as shown in Figure 5.5.
In the “normal mode”, supplementary transistors are conductive and the logic in the middle of Figure 5.5 is operational. In the “standby” mode, the transistors are cut and the logic is no longer in service. If the supplementary transistors’ threshold is high, the sub-threshold current passing through the circuit is reduced without any need to cut the supply voltage. When the standby mode is inactive, the supplementary transistors are conductive. This leads to a diminished current, which can be delivered into a transition, with respect to a model in which these transistors are not present. This technique should be used for the gates, which are not on a critical path in the logic. This technique which combines different threshold voltage gates is called MTCMOS.
Figure 5.5. MTCMOS architecture
This is a logic that uses the body effect to change threshold voltages. It requires a supplementary polarization grid, which increases the circuit surface.
Circuits do not always need to have a very fast speed. During these periods, it is worth reducing the supply voltage and increasing the threshold voltage. Particular techniques have been developed to implement this principle in real time.
Dynamic voltage scaling (DVS) is a more flexible, yet also more elaborate, solution than being restricted to two voltages, such as in the voltage scaling technique. The following elements are required to perform it:
Similarly, it is also possible to modify the threshold voltages of constricted applications, which has impacts on the dissipated power. This technique is called dynamic Vth scaling (DVTS).
By choosing a value equal to 30% of the maximum frequency as the system’s operating frequency, it is possible to diminish the static power by a factor of 10 cue, which shows the power of this optimization method. It is fairly easy to compare the two techniques (DVS and DVTS), which have very similar effects on the reduction of static dissipation. The advantages of the DVTS technique are more at the implementation level. In fact, the circuits used to control the applied voltages are charge pumps, which are fairly easy to perform. However, the substrate does generate a noise that could upset the logic. Also note that CMOS technology that is compatible with well polarization is more complex and will incur a greater cost.
This last section exemplifies the techniques already discussed when presenting the developments of circuits used in 90, 65 and 45 nm technology for mobile phones. It explains Texas Instrument’s choice in integrated circuit developments for telephony and multimedia.
The increase in sub-threshold currents in 90 nm nodes has been a real problem for circuit designers because of the gate’s static current change from 100 pA to 1 nA. This is so much of a nuisance that the demand for low-consumption circuits has increased greatly, particularly in mobile phone development. Batteries have also made progress, but not as fast as the demand for energy efficiency. There are three techniques that have been used to combat this problem: power gating in standby mode, energy management of SRAM and using long-channel transistors.
The principle of power gating has already been discussed previously in general techniques. When a group of gates is not in use, not only does the clock need to be interrupted but the conduction paths from supply to the ground do as well. Very low leakage isolation transistors are chosen for this. This principle is not so simple when it comes to implementing it, because it requires specific grids for distributing the supply voltage and reference voltages.
The same principle can be applied to SRAM, which contributes significantly to consumption as the volume of stored bits is continually increasing. In certain SRAM, however, it is crucial that no information is lost from the memory. Consequently, in the standby mode, power is not cut but lowered and the sub-threshold current is reduced by modifying the threshold voltages. This is generally done using a body-polarization technique.
Logic performance can be optimized by simultaneously using short- and long-channel transistors. The short-channel transistors are fast but have a high sub-threshold current and the long-channel transistors are slower and have a lower sub-threshold current. Fast transistors are assigned to critical paths.
The techniques described for 90 nm technology are also used here, but there are also some others which have been introduced. These include using different supply voltages and particular management of the standby mode.
Unlike the techniques implemented in 90 nm node technology with two supply voltage values, the techniques for power management in 65 nm technology make use of a whole voltage range. Each logic block corresponds to an optimal operating point associated with a particular supply value. Also take note of the effects of temperature in supply management.
Finally, circuits are generally designed to include an “off” mode. A restricted part of the circuit remains supplied and operational, capable of waking up the entire circuit.
Two new techniques have been introduced to further increase the flexibility: the adaptive body bias technique and the Retention Til Access technique.
In classic circuits, the backside gate is connected to the ground for NMOS transistors and to the supply voltage for PMOS transistors. Modulating these voltages allows us to change the threshold voltages, as shown in Chapter 3. This technique was discussed in the first section. It is efficient and avoids using transistors that are manufactured differently. More often than not, it is applied in a selective, rather than global, manner.
SRAM supply technologies have also been improved. Memories are divided into separately supplied blocks. Each block is maintained in a state in which the sub-threshold current is weak. This is called the retention state. If the reading or writing function requires it, this block is placed in the standard state.
Sub-threshold circuits have three main characteristics: they operate at a weak voltage, they consume little and their maximum operating frequency is very weak. They are, therefore, applied in very specific contexts (watch electronics, autonomous sensors, etc.) Specific rules need to be applied when designing them. For example, the transistor size is chosen such that the rise and fall times are equal. Also remember that the threshold voltage and the supply voltage have optimal values (as shown in Chapter 4). Optimization includes the dissipation and the need for a reasonable sped. It is possible to obtain an operating frequency in the scale of Megahertz, for example.
When the frequency is limited, the pipelined and parallel architectures are well adapted to this regime. Parallelization techniques duplicate the material and increase the total sub-threshold current, which is contrary to what is intended. Therefore, optimization is necessary.
It is, no doubt, in the area of SRAM that sub-threshold techniques have brought the biggest gains to dissipated power. They have also led to the move from six-transistor architectures to eight-transistor architectures.
Figure 5.6. Classic SRAM architecture
The architecture of basic functions (NAND and NOR gates, DRAM and SRAM cells) have not evolved much since the beginning of CMOS technology. As these SRAM developments are rare events, they deserve to be highlighted. A classic SRAM architecture is shown in Figure 5.6.
The value “1” is written in the cell when the two signals Bit and are placed at 0 and 1 V, considered to be the “off” and the “on” states. The PMOS on the right is conductive and the inverter output on the right has a value equal to the supply voltage. This level is maintained if the line defining the writing mode moves to “0”, as shown in the “memorized state” diagram.
This circuit can be difficult to stabilize when the transistor sizes diminish as the noise sensitivity increases. This situation is even more critical in subthreshold regimes. This has led circuit designers to propose an eight-transistor model, instead of a six-transistor model, separating the reading and writing functions. Little by little, this model has established itself in the world of low-consumption electronics.
When the speed constraints are average, circuits operating slightly above the threshold become particularly interesting. In fact, the trade-off between dissipation and speed has been resolved rather well in this operating mode. In order to set these ideas down, a 20% dissipation increase can be compensated by a performance increase by a factor of 10.
Figure 5.8 illustrates the optimization problem in separating the space into two regions: that of possible operating points and that of impossible operating points. The border is the combination of these optimal points.
Figure 5.8. Constrained optimum
It is possible to choose any point along this line, depending on the application’s constraints. It is easy to verify that calculating the minimum of the product energy/operation multiplied by the delay leads to an optimum that is close to that obtained when the only aim is to minimize the operation time. This criterion is not necessarily the best when the dissipation needs to be low. Another is an approach inspired by methods used in economics and consists of optimizing one of the constrained quantities, such as the energy in a given computing time value. This approach is called the relative sensitivity method and enables us to determine an operating point on the possibilities border (as indicated in Figure 5.8).
If the parameters that we can choose are noted as x, y and z (for example, the transistor size W, the supply voltage VDD, the threshold voltage VT, etc.), the relative sensitivities can be defined as:
With the optimum thus defined, we can show that the relative sensitivities are equal. Also note that the quantity retained is not simply one gate’s delay, but the delay in a chain of gates between two registers, considered to be the minimal operating period.
When applied to the optimization problem in a near-threshold regime, these relations can define the operating point, providing that there are models for the delay in a chain of gates and for the dissipated energy. The following calculations are quite complex, but they illustrate the near-threshold regime which is not very well known in the literature. They also explain how to implement the relative sensibilities method, which is widely used.
The energy per operation can be defined as the ratio between dissipated energy along the gate path and the average activity rate α. It is much weaker than the gates that are active. This is a very useful quantity for estimating energy efficiency.
The expressions for energy and delay are derived from the EKV model, which was specifically developed for the near-threshold regime. The drain current is written as:
The different technological terms have been defined in Chapter 3, except for the inversion coefficient IC and the fitting coefficient k:
In this term, we recognize the factor expressing the sub-threshold slope n0 and the DIBL coefficient η.
The “off” current, meaning for a gate at absolute zero, in this model is written as:
The propagation time in a gate is calculated as the necessary time for charging the output capacitance. The parameter kʹ is a fitting coefficient:
The charging capacitance of a gate i must include the input capacitance of connected gates and parasitic capacitances:
The variables sizes Wi are the transistor widths and γi is a defined parameter in the previous relation for expressing the parasitic capacitance values. The delay along the chain of N gates is therefore:
with
The dissipated energy on the same path can be calculated:
Note that the static energy is calculated during a time that is equal to the sum of delays of all the gates crossed. This time is the minimal propagation time. The dissipated energy is also written as:
The following quantities are defined as:
These relations allow us to calculate the relative sensitivities. Calculations which have not been detailed in this paragraph allow the following to be obtained:
The value Eswi is the dissipated dynamic energy in the gate i and fi is the effective fan-out of the gate i defined in [MAR 10] by Wi+1/Wi. The parameter N0 is defined by:
Figure 5.9 uses the 32-bit adder to show the different relative sensitivities according to the normalized delay (the delay divided by the minimum delay). The three parameters are the gate width, the supply voltage and the threshold voltage.
Figure 5.9. Examples of relative sensitivity depending on the energy (from MAR [MAR 10])
The optimizations in the normal regime differ from those in the near-threshold regime. In the usual regime (near the optimal point corresponding to the minimal delay), the most efficient parameter is the gate width. In the near-threshold regime, the most efficient parameter is the supply voltage. This behavior is naturally explained by the fact that the subthreshold current is exponentially dependent on the voltage.
This section has not only introduced the possibilities for circuits that operate near the threshold, but has also described a general method for energy optimization at the system level.
There are two types of electrical connections in a circuit: the wires between gates at a short distance and the longer distance links (wires between blocks or bus lines transmitting bit groups to every point in the circuit). Also note that the distribution of the clock is more complex in the circuits than the distribution of other signals.
If all of these wires are taken into account, the consumption to be assigned to the interconnect is the entire dynamic power. In fact, the dynamic power is directly dependent on the value of the gate load capacitance. Generally, what is attributed to the interconnect is the dynamic consumption coming from the voltage transitions travelling long distances (wires between blocks, bus lines and clock distribution). Dissipated energy in the short wires between gates is attributed to logic. Dissipation assigned to the interconnect can represent 60% of the total dynamic consumption. The interconnect corresponds to the upper levels of a set of connection levels in an integrated circuit. Throughout the transistor scaling, the amount of levels did not stop growing. Included in these levels are those serving the voltage and ground grids.
There is a fairly simple model, which allows us to write the equivalent capacitance of a conductive wire with a length L located at a distance h of an equi-potential plan:
The equivalent capacitances translating the coupling with the other wires are much more difficult to calculate, as they are dependent on geometry. They increase the capacitance by the following relation.
When the interconnect is long, the resistance R corresponding to the ohmic losses in the material needs to be added. Remember that the electromagnetic wave travels up the interconnect at a velocity that does only depends on the insulating material’s relative permittivity. The quantity c0 is the velocity of light:
The characteristic impedance of the interconnect Z0 and the attenuation coefficient β can be defined by:
The characteristic impedance depends on the geometric parameters of the line, and it allows us to estimate the reflection rate of a signal traveling along a wire loaded by an impedance Z:
When the line has been adapted, there is no reflection. When the line is short enough for the propagation or travelling time to be smaller than the travelling impulsion rise time, the reflection effects do not need to be included. This is the case for the signals that are travelling from gate to gate and between blocks a short distance apart. In this case, the interconnect is equivalent to an RC circuit. One gate connected to another gate can be represented, as shown in Figure 5.11.
Figure 5.11. Links between gates
This simple diagram brings about the following relations:
– Equivalent capacitance: C + CL
– Propagation time:
When the interconnect lengths increase, the capacitance increases and so does the dissipation. A solution to this is to input repeaters as shown in Figure 5.12.
Figure 5.12. Interconnect with repeaters
Delay and total consumption will be quantified and compared with those on the interconnect that does not have repeaters. The terms CL and rg are ignored. The terms r and c are the resistances and capacitances per length unit. The time tp is the repeater’s delay:
– Delay with m repeaters:
– Delay without repeaters:
From this, an optimal amount of repeaters and optimal length can be easily deduced:
The delay is then multiplied by , which is generally less than 1 if L is high.
Note that this optimization affects only the delay and not the dissipated energy:
– Energy without repeater:
– Energy with repeaters:
The dissipated energy with repeaters is greater than that dissipated without repeaters by the same link distance. In this case, we can clearly see how the trade-off between speed and dissipation is managed.
When the lines are adapted, the dissipated power is very different. It is calculated according to the gate’s input and output impedances. The output impedance is part of the signal generator, while the input impedance is part of the loading gate. When the transmission line is adapted, meaning when the input and output impedances are Z0, the connection’s characteristic impedance, and when we assume that the probability in the “on” state is equal to that in the “off” state, the total average dissipated power is:
The dissipated power is no longer dependent on the frequency. It no longer depends on the transmission line capacitance. The values obtained for the characteristic impedance are relatively weak (from 50 Ω to a few hundred Ω). Consequently, the dissipated powers are high when compared with those dissipated using non-adapted interconnects.
When the transmission line is not adapted and when the load is at high impedance, the calculation of the average dissipated power is more complex. It is given by the following relations, which depend on the duration ts of a signal that is associated with logic data and on the delay td due to the interconnect:
These results are shown in Figure 5.13.
Figure 5.13. Dissipated power in an adapted or unadapted link
Three techniques will be introduced here: reducing the voltage excursion; reducing the activity rate and reducing the interconnect’s capacitances.
This is the most efficient technique due to the dissipation’s quadratic dependence on the voltage. Two solutions have been proposed. The first is to insert circuits moving the voltage between the two states from VDD to . The second is to simply insert a regulator with no losses in order to lower the supply voltage from VDD to
by placing a series diode, for example.
In the first case, the power becomes:
In the second case, the following equation is obtained:
This gain in dissipated power is paid for by a decrease in the possible data flow and a surface overhead. The consumption of supplementary circuits should be restricted.
To reduce the activity rate on the long lines, it is possible to change the data coding at the link’s input and output. This technique is generally applied to the data bus and can reap gains of 20%.
The interconnect’s capacitance per unit length depends on the relationship W/h, which is difficult to reduce for technical reasons, and on the insulating material’s permittivity, which cannot be weaker than that of the vacuum. The so-called “low key” technologies using porous materials have attained a level that is close to the optimum. The obtained value of 2 pF/cm is valid for all the interconnect levels in integrated circuits. Therefore, the only solution is to reduce the interconnect distances. This principle leads to prescribing the parallel structures in which the data buses connecting the blocks are of a weak length when compared with the circuit dimension.
Two promising technologies have been introduced for increasing the interconnect performance: Network on Chip architectures and optical interconnects. Their interest is still the subject of various debates and their industrial application is rather limited.
The aim of the Network on Chip architecture is to replace the classic data bus with links inspired by network-type architectures, such as the Internet. All of the blocks are directly interconnected, which, in principle, restricts the connection lengths. Special circuits need to be envisioned which can sort and guide the data. These complex circuits can introduce a supplementary consumption.
Optical links have proven their efficiency at establishing high data rate links along significant distances. The main characteristic of an optical link compared with that of an electrical link is that the attenuation and the dissipation depend weakly on the distance. Comparison of electrical and optical technologies on shorter distances can only be done by taking into account the possibility that the voltage excursion for the electrical lines will be reduced. High-performance optical modulators need to be invented for optical links. This comparison leads to the definition of a critical distance dependent on the data rate from which the optical efficiency is indisputable. This distance does not stop getting shorter with time and is around a few centimeters for high data rate. Although this technology is today only reserved for computing servers, tomorrow its value and use could be generalized to commonly used circuits.