Rehan Ahmed
Department of Electrical and Computer Engineering,
University of Madison Wi,
Email: rahmed3@wisc.edu
A design methodology for superscalar architecture optimization for DSP operations is proposed. The optimization considers both performance and power consumption metrics.
Two basic approaches are initially evaluated. The first approach follows a heuristic recursive algorithm based on the general working of superscalar architecture. The second approach uses simulated annealing to converge to an optimized configuration. A hybrid approach that uses initial configuration obtained from simulated annealing and further improvement using architecture heuristics has also been developed. Over 200% improvement over initial configuration has been achieved through the hybrid method.
The results have been validated through various DSP benchmark applications in the
MiBench Suite [1].
Index Terms— Superscalar, Search and optimization, Simplescalar, Wattch
Power consumption has become a primary concern in modern processor design. With the advent of mobile processing platforms, it is becoming more and more critical to design efficient architectures; taking into account power consumption at initial design stages.
This paper presents an approach for guided design space exploration of superscalar architecture. The methodology used is a combination of random search and optimization algorithm and a heuristic approach based on the functioning of superscalar architecture.
Previous approaches in the area of superscalar optimization use simulated annealing [2], random search [3], genetic algorithms [4] and multidimensional optimization algorithms.
However, in these approaches, only a limited set of architectural parameters has been considered. This results in an exploration of a small portion of the actual design space.
Furthermore, both power consumption and performance have not been concurrently considered in the existing approaches. Other methods used for superscalar optimization improve upon the individual functional blocks such as cache [6, 7] resulting in an overall reduction in the efficiency and performance of a given architecture. The methodology followed in this paper performs evaluations and optimizations at global architectural level. This paper extends our previous work in superscalar architecture design and
optimization [8]. In our prior work, a methodology based on heuristics was proposed; based on which, an automated design and optimization tool (SSOPT) was developed. The optimized architecture specifically targeted sample rate conversion operations for software radio applications. In this paper, architectural improvements have been conducted for generic DSP benchmark applications. Furthermore, the optimization approach has been extended by evaluating two new approaches. Simulated annealing has been used for conducting global architectural optimization and these optimization results have been compared with those of SSOPT [8]. Based on these results, a hybrid approach has been proposed which utilizes the strong points of both its constituent approaches.
During all optimizations, performance, power consumption and complexity of architectures have been considered to validate the feasibility of optimized configurations.
In the remaining paper, Section 2 explains the simulation framework developed and used throughout this paper. Section 3 includes the working of SSOPT; the optimization tool based on the functioning of superscalar architecture. Section 4 covers the operation of simulated annealing based optimization approach. This is followed by a brief explanation of benchmark applications and the optimization results of both approaches in section 5.
The hybrid approach based on the simulation results of section 5 is given in section 6, followed by conclusion and references.
2.1. Simulation Tools
All optimization methodologies evaluated in this paper consider power consumption, performance and the complexity of superscalar architectures for validating processor configurations. For getting power and performance measures, Simplescalar architectural modeling tool [9] and its power extension Wattch [10] have been used. All power estimates have been given at an operating frequency of 1GHz and 100nm CMOS technology is assumed. Conditional clock gating (cc3) has been assumed for all power estimates. This assumes full power consumption for active units and 10% power consumption for inactive architectural units. Although, complexity and area estimates have not been considered in the optimization’s objective function, they have been estimated using the cost model given in [11]. An automated tool has been developed which implements the cost model given in [11]. Complexity estimates of optimized architectural configurations have been compared with commercial architectures.
2.2. Benchmark Applications
This paper targets optimizations for a range of complex DSP operations. For this reason, a subset of MiBench [1] benchmark suite for communication operations has been used.
The applications for which optimization have been conducted are: (i) FFT, (ii) IFFT, (iii)
CRC32, (iv) ADPCM Encode, (v)ADPCM Decode. FFT and IFFT perform Fast Fourier
Transform and its inverse respectively. CRC performs Cyclic Redundancy Check on a given data. ADPCM encode and decode functions implement Adaptive Differential Pulse
Code Modulation.
2.3. Gain Evaluation
As mentioned in section 2.1, the optimization approaches proposed in this paper consider power consumption, performance and complexity estimates for validating configuration changes. All methodologies effect architectural changes iteratively and compare the changed configuration to the former configuration. For such a comparison, a parameter ‘Efficiency-Gain’ has been evaluated which is a combination of performance and efficiency measures. The basic expression for efficiency gain is given in equation 1.
The parameter takes a weighted sum of IPC (Instruction per Cycle) gain and APPI
(Average Power per Instruction) gain. The relative precedence given to APPI and IPC can be set before the start of a given optimization by changing the value of ‘scale’ [8].
(1)
The optimizations covered in this paper target architectures with a high level of efficiency. For this reason, APPI has been given much higher precedence than IPC. The value of scale has been set 50 to give it higher weightage.
An automated superscalar optimization tool [8] has been previously developed for architectural optimizations. The tool is based on the general working of superscalar architecture and considers both power consumption and performance for evaluating architectural configurations.
SSOPT conducts optimization by progressing through various stages. Each stage corresponds to a change in the configuration of a certain architectural parameter or parameters. For instance, the first stage covers the optimization of Branch Table. In each stage, a given set of simulation results are analyzed. If the results show improvement, the configuration change for that particular stage is repeated. In case of deterioration, the configuration change is discarded and SSOPT moves to the next stage. The details of each stage are given in Table 1. To increase the effectiveness of optimization methodology, the configuration changes in each stage are bidirectional.
Table 1: SSOPT Stage Parameters
Configuration Change Monitored Result dir =0 dir=1
1
2
Branch Table
BTB
Branch_Misses
IPC/APPI
Incr
Incr
Dec
Dec
3
4
5
6
7
8
9
10
11
Return Address Stack IPC/APPI
IFQ, Exec Win, I ALU IFQ_full, Eff_gain, IPB
I ALU
I Mul/Div
FP ALU
Eff_gain
Eff_gain
Eff_gain
FP Mul/Div
RUU
LSQ
I-Compress
Eff_gain
Eff_gain
Eff_gain
Eff_gain
Incr
Incr
Incr
Incr
Incr
Incr
Dec
Dec
En
Dec
Dec
Dec
Dec
Dec
Dec
Inc
Inc
En
12 I-Cache Eff_gain Dec Inc
13 D-Cache Eff_gain Dec Inc
14
15
Instruction TLB
Data TLB
16 Bus Width
17 Memory To System Ports
Eff_gain
Eff_gain
Eff_gain
Eff_gain
Dec
Dec
Inc
Inc
Inc
Inc
Dec
Dec
18 Exit Stage Eff_gain Nil Nil
Another important feature of SSOPT is its recursive nature. After completion of all 18 stages (Optimization sub-cycle), SSOPT compares the simulation results of final architectural configuration with that of the initial configuration and calculates the efficiency gain parameter ‘Eff_Gain’. If the Eff_Gain turns out to be positive, this implies that further improvement may be possible. In that case, SSOPT considers the final configuration of the completed optimization sub-cycle initial configuration, and goes through all of the 18 stages again. This gives the algorithm its recursive nature and reduces the dependence of initial configuration on the entire optimization process.
However, this optimization approach has a tendency of converging to local optimal points; due to its heuristic nature. This effect is reduced to some extent by selecting a scalable initial configuration. To ensure scalability, the size of caches, RUU and LSQ are kept high in the initial configurations.
In this approach, a canonical search and optimization algorithm has been developed that is based on simulated annealing [12, 2]. The flowchart followed by simulated annealing algorithm is given in figure 1. For a given configuration change, the algorithm randomly selects an architectural parameter and its value from a defined set (Table 2).
The configuration change is effected and Eff_Gain value is evaluated according to equation 1. If the configuration change yields a positive gain, the configuration change is finalized and the process is repeated. In case of negative gain, the configuration is accepted based on the probability given by equation 2.
Initial
Configuration
Configuration
Simulated. Results
Recorded
Configuration
Changed
(Table 2)
Number of
Simulation <250
Configuration
Simulated.
Results Recorded
Gain Evaluation
(Equation 1)
Positive
Gain
Revert to previous configuration
Configuration
Not chosen
Probabilistically
Negative Gain
Probability of acceptance calculated
(Equation 2)
Configuration
Probabilistically
Chosen
Configuration
Change Finalized
Number of
Simulations
=250
Configuration
Finalized
Figure 1: Implementation Flowchart of Simulated Annealing Optimization s
log p
exp
( sim _ num )
(
1
Eff _
1
Gain
s
10 )
(2)
In equation 2, ‘sim_num’ is the number of simulations and p is the probability of acceptance of a configuration. As an additional clause, the probability of acceptance of a configuration having negative gain can never exceed the value 0.5. The probability function is plotted against simulation index for various values of gain in figure 2. As illustrated in the figure, the probability of selecting configurations yielding negative gain, at the beginning of optimization process is higher than its probability at the concluding stages. This feature ensures that a large portion of design space is explored and the algorithm, more or less converges to an optimized configuration at the end of the optimization process. It is important to note that, since the value of scale in equation 1 has been set to ‘50’, a gain of -10 corresponds to a 20% increase in APPI value while the effect of IPC is negligible. During the entire optimization process, a total of 250 simulations are conducted and the configuration yielding maximum gain is selected.
These simulations were run on Pentium-4 2.0 GHz machines with 512 MB RAM and
Ubuntu (Linux) operating system. Time required to run 250 simulations varied between 6 to 12 hours depending upon the benchmark. The set of parameters and their values that simulated annealing considers while effecting configuration changes are given in Table 2.
As is evident from the table, a very wide range of configuration parameters and their values have been selected. This is in contrast to the approaches presented in [2][3][4][5], contributing to a relatively higher level of optimization.
0.5
0.4
0.3
0.2
0.1
0
0
0.9
0.8
0.7
0.6
Probability = 0.5 Reference Line gain= - 2 gain= - 3 gain= - 5 gain= - 10
50 100 150
Simulation Index
Figure 2: Probability function
200 250
15
16
17
18
5
6
7
Sr No Parameter
1 IFQ
2
Configuration
1, 2, 4, 16, 32
Branch Table 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384
3
4
RAS
BTB
Table 2: Parameter Values
16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192
16 4, 32 4, 64 4, 128 4, 256 4, 512 4, 1024 4, 2048 4, 4096 4,
8192 4
Decode Width 1, 2, 4, 16, 32
8
9
10
Issue Width
Commit Width
RUU
LSQ
I Cache
1, 2, 4, 16, 32
1, 2, 4, 16, 32
8, 16, 32, 64, 128, 256, 512, 1024
8, 16, 32, 64, 128, 256, 512, 1024
4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l,
256:32:4:l, 1024:32:4:l, 2048:32:4:l, 8192:32:4:l
11
12
13
14
D Cache
Bus Width
I TIB
4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l,
256:32:4:l, 1024:32:4:l, 2048:32:4:l
4, 8, 16, 32, 64
1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l,
32:1024:4:l, 64:1024:4:l, 128:1024:4:l
D TLB
I ALU
I Mul/Div
1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l,
32:1024:4:l, 64:1024:4:l, 128:1024:4:l
1, 2, 4, 8
1, 2, 4, 8
Memory Ports 1, 2, 4, 8
FP ALU 1, 2, 4, 8
19 FP Mul/Div 1, 2, 4, 8
The optimization process has been automated through the development of a software tool. The tool basically comprises of a script file and an ANSI C file. The primary function of script file is issuing commands to simplescalar and transferring control to C file and the software module which gives area estimates. All other operations of the simulated annealing algorithm are managed by the C file. This includes: (a) analyzing simulation results; (b) calculating probability of acceptance incase of negative gain; (c) selecting a given configuration change randomly from the set defined in Table 2; (d) finalizing configuration, either as a result of positive gain or probabilistically incase of negative gain; and finally (e) identification of optimization termination point.
4.1. Optimization Results
The results of optimizations are given in figure 3 through figure 7. Results indicate the evolution of ‘gain’ for both SSOPT and simulated annealing algorithms against simulation index. In equation 1, since the value of scale has been set to 50, APPI gain value has a larger contribution to the actual gain value. For the results given in this section, gain has been evaluated using equation 3.
(3)
It is evident that in most cases, simulated annealing performs much better than the methodology followed by SSOPT. The primary reason for this is that, due to the heuristic nature of its algorithm, SSOPT tends to converge to local optimal points. The final configuration is dependant on the initial configuration; and hence only a small portion of the actual design space is explored. Simulated annealing however does not have this limitation, and the possibility of probabilistically selecting configurations which give negative gain ensures that a large portion of design space is explored. However, this random nature of simulated annealing algorithm makes further optimization possible.
This is main reason why SSOPT outperforms simulated annealing in FFT results.
350
300
250
200
150
100
50
0
-50
1
-100
26 51 76 101 126 151 176
SA
201 226
SSOPT
Figure 3: FFT Result
400
300
200
100
0
-100
1
-200
400
350
300
250
200
150
100
50
0
-50
-100
1
400
300
200
100
0
-100
1
-200
26 51 76 101 126 151 176 201 226
SA SSOPT
Figure 4: IFFT Results
26 51 76 101 126 151 176 201 226
SA SSOPT
Figure 5: CRC Results
26 51 76 101 126 151
Figure 6: PCM Decode Results
176
SA
201 226
SSOPT
500
400
300
200
100
0
-100
1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241
SA SSOPT
Figure 7: PCM Encode Results
Based on the simulation results of last section, a hybrid approach has been devloped which ensures better optimization than both approaches presented in sections 2 and 3.
The objective is to select the architectural configuration which gives maximum gain in simulated annealing results. The selected configuration is then used as an initial configuration for SSOPT algorithm. The flowchart for hybrid approach is given in figure
8. This approach utilizes the strong points of both optimization algorithms. The first phase, comprising of simulated annealing algorithm helps converge to a more or less global optimal point. The later stage comprising of SSOPT, fine tunes the configuration given by simulated annealing algorithm and gives further performance improvement.
Results of this approach are given in figure 9 through figure 13 and illustrate the evolution of gain versus simulation index. The value of gain is calculated according to equation 3.
Initial
Configuration
Simulated
Annealing Applied
(Section 3, Figure1)
Architectural configuration with maximum gain selected
SSOPT applied
(Section 2)
Figure 8: Hybrid Methodology Flowchart.
Configuration
Finalized
500
400
300
200
100
0
-100
1
-200
400
350
300
250
200
150
100
50
0
-50
-100
1 51 101 151 201 251 301 351
Gain
401
Figure 9: FFT
350
300
250
200
150
100
50
0
-50
-100
-150
1
Gain
51 101 151 201 251 301 351 401 451 501
Figure10: IFFT Results
Gain
51 101 151 201 251 301 351
Figure 11: CRC Results
500
400
300
200
100
0
-100
1
Gain
51 101 151 201 251 301 351 401
Figure 12: PCM Encode Results
500
400
300
200
100
Gain
0
1 51 101 151 201 251 301
-100
Figure 13: PCM Decode Results
It is evident from the results depicted in figures 9 – 13 that the proposed hybrid optimization algorithm converges to architectural configurations having greater level of optimization than either simulated annealing approach or SSOPT. It displays performance improvements in all the considered benchmark applications; with the exception of PCM decode in which case, the final configuration acquired through simulated annealing algorithm already has a very high level of optimization. The performance measures of optimized architectural configurations are summarized in Table
4.
The architectural configuration having maximum transistor count is the optimized configuration corresponding to PCM encode operation. It is estimated as 16,258,447 transistors; which is less than the transistor count for Pentium IV processor (estimate
42,000,000 transistors). This makes the practical implementation of our final configurations possible. The final architectural configurations for different DSP benchmarks are given in Table 5.
Bench
FFT
IFFT
CRC
PCM E
PCM D
IPC
1.4528
1.1156
1.4312
0.9895
1.0516
Table 4: Performance Measures
APPI Perf (GIPs) Mips/W Transistors
4.4478
4.8753
1.4528
1.1156
3.6791 1.4312
4.1501 0.9895
3.8413 1.0516
224.8255
205.115
271.8122
240.9653
260.3357
10802573
3312533
4194699
16258447
1494591
Configuration Parameter
Instruction Fetch Queue
Table 5: Architectural Configurations
FFT
16
IFFT
4
CRC
32
Branch Table Size
Return Address Stack
16384 32768 4096
8 16 8
Branch Target Buffer 1024
Instruction Decode Width 32
1024
16
256
32
Instruction Issue Width 2
Instruction Commit Width 4
Register Update Unit 16
2
32
16
2
4
16
Load Store Queue
D Cache
I Cache
Memory Bus Width
Instruction TLB
Data TLB
Integer ALUs
Integer Mul/Div
16 8
512Bytes 2 KB
32KB
8KB
2
1
32KB
16 KB
4
1
8
1 KB
PCM E
32
256
32
2
32
8
PCM D
2
65536 2048
64 4
128
32
2
16
8
2KB 4KB 512Bytes 512Bytes 512Bytes
64 Bytes 64 bytes 64bytes 64bytes 8bytes
16KB
16KB
2
2
4
8Kb
16KB
4KB
2
4
8
512Bytes
8KB
64KB
2
1
Memory to System Ports 4
Floating Point ALU 1
Floating Point Mul/Div 4
2
1
4
4
1
1
2
1
1
1
1
1
It should be noted that the results reported in this section cannot be directly compared with the optimization results of previous approaches. This is primarily because none of the optimizations covered in [2 - 5] have been conducted on MiBench Benchmark suite
[1]. Furthermore, the initial configuration over which the value of gain is computed is different from the starting point in the former approaches. Similarly, Simplescalar and
Wattch have been used in this work for determination of performance. Whereas, in other cases (e.g. [2]) in-house customized tools have been used. As far as performance improvements are concerned, 60% increase in performance along with 50% increase in power consumption was reported in [2]. Our work targets high efficiency designs and in all cases an improvement of more than 200% in processor efficiency over initial architectural configuration has been observed. Performance versus area tradeoff has also been reported in some prior work (e.g. [3]). In our approach, complexity estimates do not form part of the cost function used for validating configuration changes. The transistor
counts of final obtained configuration have however been provided to validate the practical implementation aspects.
Presently I am working on minimizing the optimization time for this optimization tool.
For this, several measures have been taken. A new feature has been added to the optimization tool called “configuration identifier”. It records the architectural parameters for a given configuration and its performance results. During the optimization process, if the configuration is encountered again, then it is not going to be simulated and the results from the previous configuration result will be recorded. The simulated annealing stage of this framework has a high tendency of re simulating a given architectural configuration since the configurations are selected on a random basis. Given this, the configuration identifier measure should save significant optimization time. Furthermore, the configuration parameters considered by simulated annealing approach (Table 2) have been limited based on the optimization results. This should help to decrease the simulation cycles required by simulated annealing for converging to an optimal architectural configuration. However, the probability function given in equation 2 has to be changed to ensure convergence at simulation index < 250. Both of these measures have been implemented; however, the results for them are presently pending.
In this paper, two superscalar optimization approaches have been compared. The first approach systematically changes architectural parameters based on heuristics. The second approach is based on the simulated annealing algorithm; randomly selecting architectural configurations from a large configuration set. Based on the optimization results of these approaches, a hybrid approach has been proposed, which provides major performance improvements. DSP benchmarks from MiBench benchmark suite have been used for conducting optimizations. Efficiency gain of more than 200% over initial processor configuration is observed for all benchmark applications. The resultant architectures have efficiency of up to 271 MIPs/W while giving performance of up to 1.4312 GIPs. The complexity of all optimized configurations is less than the transistor count of Pentium IV processor.
[1] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, R.B.Brown,
“MiBench: A free, commercially representative embedded benchmark suite”, IEEE
International Workshop on Workload Characterization, 2001, pp 3-14.
[2] T.M. Conte, K.M. Menezes, S.W. Sathaye, “A technique to determine powerefficient, high-performance superscalar processors”, Proceedings of the 28th Hawaii
International Conference on System Sciences, pp. 324-333, Vol 1, 1995.
[3] S.V. Haastregt and P.M.W. Knijnenburg, “Feasibility of Combined Area and
Performance Optimization for Superscalar Processors Using Random Search”,
Proceedings of Design Automation and Test in Europe, 2007, pp. 1-6.
[4] Mauro Olivieri, “A genetic approach to the design space exploration of superscalar microprocessor architectures”, Proceedings of the 2001 IEEE International Symposium on Circuits and Systems, pp. 69-72, Vol 5.
[5] V. Zyuban and P. Kogge, “Optimization of high-performance superscalar architectures for energy efficiency”, Proceedings of the 2000 International Symposium on Low Power Electronics and Design, 2000, pp. 84-89.
[6] A. Asaduzzaman, I. Mahgoub, P. Sanigepalli, H. Kalva, R. Shankar, and B. Furht,
“Cache Optimization for Mobile Devices Running Multimedia Applications”,
Proceedings of the IEEE Sixth International Symposium on Multimedia Software
Engineering, 2004, pp. 499-506.
[7] A. Ghosh, T. Givargis, “Cache Optimization for Embedded Processor Cores: An
Analytical Approach”, International Conference on Computer Aided Design, 2003, pp.
342-347.
[8] F. Sheikh, S. Masud, R. Ahmed, "Superscalar Architecture Design for High
Performance DSP Operations", Microprocessors and Microsystems, Elsevier
(accepted for publication) http://dx.doi.org/10.1016/j.micpro.2008.10.002
[9] T. Austin, E. Larson, D. Ernst, “SimpleScalar: An infrastructure for Computer
System Modeling”, IEEE Computers, Feb. 2002, pp. 59-68.
[10] D. Brooks, V. Tiwari, M. Martonosi, “Wattch: A Framework for Architectural-Level
Power Analysis and Optimization”, Proceedings of 27th International Symposium on
Computer Architecture”, 2000, pp 83-94.
[11] M. Steinhaus, R. Kolla, J. Larriba-Pey, T. Ungerer, and M. Valero “Transistor count and chip-space estimation of simplescalar-based microprocessor models”, Proceedings of the Workshop on Complexity-Effective Design, 2001.
[12] S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi, “Optimization by Simulated
Annealing”, Science, May 1983, Volume 220, Number 4598, pp 671 - 680.