Gated Clocks in RT-Synthesis and Simulation HDL&VUFE: High Level and Logic Synthesis – Developing of Standards Wolfgang Ecker – Andre Windisch – Jochen Mades – Thomas Schneider – Ke Yang Contact: Wolfgang.Ecker@infineon.com, Phone +49 89 234 45334, Fax +49 89 234 44950 Mail: Infineon Technologies – CPD DAT ADM, Otto-Hahn-Ring 6, 81730 München Introduction Due to the low-power requirements in a increasing number of products, such as mobile phones, laptops, or personal organizers, to name only some, more and more designs use clock gating to turn off complete parts of a chip and to reduce power consumption in this way. Gated clocks are also used in not-low power design, to guarantee the maximum allowed power dissipation of a chip. Alternative design methods for low power concentrate on asychronous design styles and synthesis of asynchronous circuits. Currently, clock gating in synthesis is done by explicitly modeling the gates controlling the clock. This causes semantic differences between the VHDL RT-Model and the synthesized netlist as shown next after some introductory words about synchronous design in VHDL. The following sections shortly present work-arounds by clock balancing or by delaying signal assignments. Afterwards, alternatives based on extended wait statements and guard signals are shown. Finally a proposal for a VHDL extension, the allowing of general expressions in port maps, which we call port expressions, is discussed. The semantic of this port expressions shall be based on implicit singnals. Synchronous RT-Designs in VHDL From the modeling point of view, RT-VHDL relies on the idea that the time semantic is only described in terms of clock. Many additional restrictions, which would go far beyond the intention of this section have to be considered. Details can be found in the manual of VHDL RT-synthesis tools or documents relating to the VHDL synthesis standard subset. Subsequent example shows a VHDL-code, which is interpreted by synthesis tools as a single bit flipflop without reset. entity dff is port( clk : in bit; d : in bit; q : out bit ); end dff; architecture simple of dff is begin proces begin wait until clk = ‘1’; q <= d; end process; end simple; Combining three of the flipflops to a shiftregister, which is shown in the VHDL-code below, gives a good example to illustrate the hardware-relation of the VHDL standard simulation cycle. entity shift_register is port( clk : in bit; d : in bit; q : out bit ); end shift_register; architecture structural of shift_register is signal t1, t2: bit; begin ff1: entity WORK.dff(simple) port map(clk=>clk;d=>d;q=>t1); ff2: entity WORK.dff(simple) port map(clk=>clk;d=>t1;q=>t2); ff3: entity WORK.dff(simple) port map(clk=>clk;d=>t2;q=>q); end structural; If VHDL had not the separation of signal computation and signal update, i.e. if the assignment is performed immediately, then the result of the code above would be not determined. More important, if the computation sequence is ff3, ff2 and then ff1, then the simulation result matches with the according hardware consisting of three serial connected flipflops. In contrast, if the computation sequence is reverse, i.e. ff1, ff2 and then ff3, then the simulation semantic is the same as of one single flipflip, i.e. the simulation semantic does not match with the semantic of the synthesized. hardware. Fortunately, VHDL possesses its sometimes unliked and misunderstood simulation semantic, which ensures a short delay of the signal assignment, which is often called “delta-delay”. This semantic causes in the example, that first all new flipflop output values are computed dependent on the input values and one simulation cycle later, all output values are assigned. This relates somehow to the real propagation delay through a flipflop. It is the basis for synchronous, technology independent, modeling. Simulation and Synthesis Mismatch for Gated Clocks The advantage of the VHDL signal semantic for clock related descriptions, especially RTdescriptions, works only in a correct way if the clock changes in the first simulation cycle (also called first delta cycle) of a simulation time unit and all other signals (especially the input of the flopflops) change either at another simulation time (which may occur, if they are driven by flipflops of another clock domain) or at simulation cycle one of each simulation time earliest. It also works fine if all clocks change at the same delta-cycle and the combinational signals hereafter. Gating a clock in a structural way and modeling this by zero delay gates, as usual in RTdescriptions, cause a clock signals to change at different delta cycles, i.e. one clock edge occurs one or more clock cycles after another clock edge. This implies, that the data-input of a flipflop might change before, or at the same time of the clock event. This would cause a feed-through simulation semantic of a wire which contradicts with the synthesis semantic of a flipflop. Gating only on flipflop of the shift-register, as shown below, is a good example to illustration of this scenario: entity shift_register_with_gate is port( clk : in bit; d : in bit; ff2_en : in bit; q : out bit ); end shift_register_with_gate; architecture structural of shift_register_with_gate is signal t1, t2: bit; signal ff2_en_clk : bit; begin clock_gate: entity WORK.and2( clk, ff2_en, ff2_en_clk ); ff1: entity WORK.dff(simple) port map(clk=>clk;d=>d;q=>t1); ff2: entity WORK.dff(simple) port map(clk=>ff2_en_clk;d=>t1;q=>t2); ff3: entity WORK.dff(simple) port map(clk=>clk;d=>t2;q=>q); end structural; Here, the gated clock for ff2 if derived from clk using an and-gate, which is assumed to be a zero-delay gate, i.e. the output may change one delta after the input-change only. This means, that the clk-signal at ff2-flipflop changes one delta after the clk-signal at the ff1-and ff3flipflops. This implies, that t2 the output of ff2 changes 2 deltas after the clock clk, which is sufficient late, to guarantee a correct simulation of ff3. The occurrence of a potential edge at ff2_en_clk only one delta after an edge at clk causes t1, the output of ff1 and the input of ff2 to change at the same cycle as ff2_en_clk. This is late enough to take the new value at t1 as input to ff2 immediately an not one cycle delayed. In simulation, the existence of ff1 is superfluous. A simulation/synthesis mismatch occurs. Balanced Clock for Mismatch Resolution One simple solution for this simulation/ synthesis mismatch is the insertion of delay elements for all clock signals such that all clocks have their edges at the same delta cycle. The modification of the example is shown below. architecture balanced_clock of shift_register_with_gate is signal t1, t2: bit; signal ff2_en_clk : bit; signal ff1_3_del_clk : bit; begin clock_gate: entity WORK.and2( clk, ff2_en, ff2_en_clk ); clock_delay: entity WORK.buf( clk, ff1_3_del_clk); ff1:entity WORK.dff(simple) port map(clk=>ff1_3_del_clk;d=>d;q=>t1); ff2:entity WORK.dff(simple) port map(clk=>ff2_en_clk;d=>t1;q=>t2); ff3:entity WORK.dff(simple) port map(clk=>ff1_3_del_clk;d=>t2;q=>q); end balanced_clock; It is obvious that a hughe effort has to be spent to balance all clocks and to enshure that all clocks occur at the same delta cycle. Also it has be ensured, that the instance of the buffer buf does not force the synthesis tool to infer a buffer in the clock tree. Delayed Resolution Assignments for Mismatch Another solution is the delay of the output of the flipflops for a specific amount of simulation time. Subsequent code shows the modified flipflop primitive, which has a default offset, but which also allows to set a specific one. entity dff_with_offset is generic( t_offset : time := 1 ns); port( clk : in bit; d : in bit; q : out bit ); end dff_with_offset; architecture simple of dff_with_offset is begin proces begin wait until clk = ‘1’; q <= d after t_offset; end process; end simple; Using this flipflops in conjunction with gated clocks ensures, that all outputs of flipflops change at another simulation time as the clock does. No delta race can occur in this case. This solution however has several drawbacks, which make it hard to use. One is, that propagation delay is not supported by synthesis, however it is mostly ignored. More important is, that it must be assured, that the derived clock edge is not delayed by t_offset. This can partially be reached by using flipflops without offset to drive the clock gate logic, however this solution does not work if a flipflop controlling a derived clock has also a derived clock. Using wait-statements for Clock Gating To achieve a clean solution for clock control, clock gating must be possible without the loss of one delta cycle, i.e. a logic function must be executed with the clock signal without delay. One possibility is to include the enable mechanism in the wait statement as shown in the flipflop model below. entity dff_with_enable is port( clk : in bit; d : in bit; en : in bit; q : out bit ); end dff_with_enable; architecture simple of dff_with_enable is begin proces begin wait on clk until clk = ‘1’ and en = ‘1’; q <= d; end process; end simple; The shift register can then be modified to: architecture with_enable_flipflop of shift_register_with_gate is signal t1, t2: bit; begin ff1:entity WORK.dff(simple) port map(clk=>clk;d=>d;q=>t1); ff2:entity WORK.dff_with_enable(simple) port map(clk=>clk;d=>t1;en=>ff2_en,q=>t2); ff3:entity WORK.dff(simple) port map(clk=>clk;d=>t2;q=>q); end structural; This model simulates the clock gating well. From modeling standpoint, overhead must be spent to “gate” each wait statement. Gating a complete block requires the modification of each wait statement and the port lists. This is especially for reusable blocks (or IP) not acceptable. It must be notet also, that the used format of the wait statement is currently not synthesizable. Using Guarded Blocks for Clock Gating Another possibility for clock gating is the use of the guard expression of a block for performing the required logical operation. To be able to use the code without modification, the guard must be converted (in our case to bit), to connect it to a clock signal. Due to the fact that the guard signal is an implicit signal, this form of clock gating does also not need an additional delta cycle. The solution works fine, and allows for clock gating for complete, probably reused blocks. Unfortunately, this solution is currently not synthesizable. architecture with_guard of shift_register_gate is signal t1, t2: bit; begin ff1:entity WORK.dff(simple) port map(clk=>clk;d=>d;q=>t1); clock_gate: block( clk and ff2_en ) begin ff2:entity WORK.dff_with_enable(simple) port map(clk=>to_bit(guard) ; d=>t1 ; q=>t2 ); end clock_gate; ff3:entity WORK.dff(simple) port map(clk=>clk;d=>t2;q=>q); end with_guard; Extending Type Conversion Functions in Port Maps The example above still contains some overhead for clock gating, namely the block-statement and the type conversion function. For that reason, we propose to allow for more flexible expressions in port maps as shown in the final example. The expression is then resonsible for creating an implicit signal in VHDL, which is evaluated and executed before all explicit signals are updated. architecture with_port_expression of shift_register_with_gate is signal t1, t2: bit; begin ff1:entity WORK.dff(simple) port map(clk=>clk;d=>d;q=>t1); ff2:entity WORK.dff_with_enable(simple) port map(clk=> (clk AND ff2_en) ; d=>t1 ; q=>t2 ); end clock_gate; ff3:entity WORK.dff(simple) port map(clk=>clk;d=>t2;q=>q); end structural; Summary We showed discrepancies between synthesis and simulation, when gated clocks need be modeled. We proposed and discussed several alternatives for solving that problem, which need some extension of the currently supported synthesizable VHDL subset. Finally, we proposed port expressions, a VHDL extension, which helps to efficiently model clock gating of complete blocks in a consistent way between simulation and synthesis. Thoughts, Comments, and Suggestions A similar problem ariese when derived clocks with multiple, partial frequency of phase shift need to be generated.