superscalar design space exploration and optimization framework

advertisement

SUPERSCALAR DESIGN SPACE EXPLORATION AND

OPTIMIZATION FRAMEWORK FOR DSP OPERATIONS

Rehan Ahmed

Department of Electrical and Computer Engineering,

University of Madison Wi,

Email: rahmed3@wisc.edu

Abstract

A design methodology for superscalar architecture optimization for DSP operations is proposed. The optimization considers both performance and power consumption metrics.

Two basic approaches are initially evaluated. The first approach follows a heuristic recursive algorithm based on the general working of superscalar architecture. The second approach uses simulated annealing to converge to an optimized configuration. A hybrid approach that uses initial configuration obtained from simulated annealing and further improvement using architecture heuristics has also been developed. Over 200% improvement over initial configuration has been achieved through the hybrid method.

The results have been validated through various DSP benchmark applications in the

MiBench Suite [1].

Index Terms— Superscalar, Search and optimization, Simplescalar, Wattch

1. INTRODUCTION

Power consumption has become a primary concern in modern processor design. With the advent of mobile processing platforms, it is becoming more and more critical to design efficient architectures; taking into account power consumption at initial design stages.

This paper presents an approach for guided design space exploration of superscalar architecture. The methodology used is a combination of random search and optimization algorithm and a heuristic approach based on the functioning of superscalar architecture.

Previous approaches in the area of superscalar optimization use simulated annealing [2], random search [3], genetic algorithms [4] and multidimensional optimization algorithms.

However, in these approaches, only a limited set of architectural parameters has been considered. This results in an exploration of a small portion of the actual design space.

Furthermore, both power consumption and performance have not been concurrently considered in the existing approaches. Other methods used for superscalar optimization improve upon the individual functional blocks such as cache [6, 7] resulting in an overall reduction in the efficiency and performance of a given architecture. The methodology followed in this paper performs evaluations and optimizations at global architectural level. This paper extends our previous work in superscalar architecture design and

optimization [8]. In our prior work, a methodology based on heuristics was proposed; based on which, an automated design and optimization tool (SSOPT) was developed. The optimized architecture specifically targeted sample rate conversion operations for software radio applications. In this paper, architectural improvements have been conducted for generic DSP benchmark applications. Furthermore, the optimization approach has been extended by evaluating two new approaches. Simulated annealing has been used for conducting global architectural optimization and these optimization results have been compared with those of SSOPT [8]. Based on these results, a hybrid approach has been proposed which utilizes the strong points of both its constituent approaches.

During all optimizations, performance, power consumption and complexity of architectures have been considered to validate the feasibility of optimized configurations.

In the remaining paper, Section 2 explains the simulation framework developed and used throughout this paper. Section 3 includes the working of SSOPT; the optimization tool based on the functioning of superscalar architecture. Section 4 covers the operation of simulated annealing based optimization approach. This is followed by a brief explanation of benchmark applications and the optimization results of both approaches in section 5.

The hybrid approach based on the simulation results of section 5 is given in section 6, followed by conclusion and references.

2. SIMULATION FRAMEWORK

2.1. Simulation Tools

All optimization methodologies evaluated in this paper consider power consumption, performance and the complexity of superscalar architectures for validating processor configurations. For getting power and performance measures, Simplescalar architectural modeling tool [9] and its power extension Wattch [10] have been used. All power estimates have been given at an operating frequency of 1GHz and 100nm CMOS technology is assumed. Conditional clock gating (cc3) has been assumed for all power estimates. This assumes full power consumption for active units and 10% power consumption for inactive architectural units. Although, complexity and area estimates have not been considered in the optimization’s objective function, they have been estimated using the cost model given in [11]. An automated tool has been developed which implements the cost model given in [11]. Complexity estimates of optimized architectural configurations have been compared with commercial architectures.

2.2. Benchmark Applications

This paper targets optimizations for a range of complex DSP operations. For this reason, a subset of MiBench [1] benchmark suite for communication operations has been used.

The applications for which optimization have been conducted are: (i) FFT, (ii) IFFT, (iii)

CRC32, (iv) ADPCM Encode, (v)ADPCM Decode. FFT and IFFT perform Fast Fourier

Transform and its inverse respectively. CRC performs Cyclic Redundancy Check on a given data. ADPCM encode and decode functions implement Adaptive Differential Pulse

Code Modulation.

2.3. Gain Evaluation

As mentioned in section 2.1, the optimization approaches proposed in this paper consider power consumption, performance and complexity estimates for validating configuration changes. All methodologies effect architectural changes iteratively and compare the changed configuration to the former configuration. For such a comparison, a parameter ‘Efficiency-Gain’ has been evaluated which is a combination of performance and efficiency measures. The basic expression for efficiency gain is given in equation 1.

The parameter takes a weighted sum of IPC (Instruction per Cycle) gain and APPI

(Average Power per Instruction) gain. The relative precedence given to APPI and IPC can be set before the start of a given optimization by changing the value of ‘scale’ [8].

IPC Gain

%IPC_incre ase

APPI Gain

Eff_Gain

%APPI_decr ease

IPC Gain

scale

APPI Gain

(1)

The optimizations covered in this paper target architectures with a high level of efficiency. For this reason, APPI has been given much higher precedence than IPC. The value of scale has been set 50 to give it higher weightage.

3. ARCHITECTURAL OPTIMIZATION THROUGH SSOPT

An automated superscalar optimization tool [8] has been previously developed for architectural optimizations. The tool is based on the general working of superscalar architecture and considers both power consumption and performance for evaluating architectural configurations.

SSOPT conducts optimization by progressing through various stages. Each stage corresponds to a change in the configuration of a certain architectural parameter or parameters. For instance, the first stage covers the optimization of Branch Table. In each stage, a given set of simulation results are analyzed. If the results show improvement, the configuration change for that particular stage is repeated. In case of deterioration, the configuration change is discarded and SSOPT moves to the next stage. The details of each stage are given in Table 1. To increase the effectiveness of optimization methodology, the configuration changes in each stage are bidirectional.

Table 1: SSOPT Stage Parameters

Configuration Change Monitored Result dir =0 dir=1

1

2

Branch Table

BTB

Branch_Misses

IPC/APPI

Incr

Incr

Dec

Dec

3

4

5

6

7

8

9

10

11

Return Address Stack IPC/APPI

IFQ, Exec Win, I ALU IFQ_full, Eff_gain, IPB

I ALU

I Mul/Div

FP ALU

Eff_gain

Eff_gain

Eff_gain

FP Mul/Div

RUU

LSQ

I-Compress

Eff_gain

Eff_gain

Eff_gain

Eff_gain

Incr

Incr

Incr

Incr

Incr

Incr

Dec

Dec

En

Dec

Dec

Dec

Dec

Dec

Dec

Inc

Inc

En

12 I-Cache Eff_gain Dec Inc

13 D-Cache Eff_gain Dec Inc

14

15

Instruction TLB

Data TLB

16 Bus Width

17 Memory To System Ports

Eff_gain

Eff_gain

Eff_gain

Eff_gain

Dec

Dec

Inc

Inc

Inc

Inc

Dec

Dec

18 Exit Stage Eff_gain Nil Nil

Another important feature of SSOPT is its recursive nature. After completion of all 18 stages (Optimization sub-cycle), SSOPT compares the simulation results of final architectural configuration with that of the initial configuration and calculates the efficiency gain parameter ‘Eff_Gain’. If the Eff_Gain turns out to be positive, this implies that further improvement may be possible. In that case, SSOPT considers the final configuration of the completed optimization sub-cycle initial configuration, and goes through all of the 18 stages again. This gives the algorithm its recursive nature and reduces the dependence of initial configuration on the entire optimization process.

However, this optimization approach has a tendency of converging to local optimal points; due to its heuristic nature. This effect is reduced to some extent by selecting a scalable initial configuration. To ensure scalability, the size of caches, RUU and LSQ are kept high in the initial configurations.

4. ARCHITECTURE OPTIMIZATION USING SIMULATED

ANNEALING

In this approach, a canonical search and optimization algorithm has been developed that is based on simulated annealing [12, 2]. The flowchart followed by simulated annealing algorithm is given in figure 1. For a given configuration change, the algorithm randomly selects an architectural parameter and its value from a defined set (Table 2).

The configuration change is effected and Eff_Gain value is evaluated according to equation 1. If the configuration change yields a positive gain, the configuration change is finalized and the process is repeated. In case of negative gain, the configuration is accepted based on the probability given by equation 2.

Initial

Configuration

Configuration

Simulated. Results

Recorded

Configuration

Changed

(Table 2)

Number of

Simulation <250

Configuration

Simulated.

Results Recorded

Gain Evaluation

(Equation 1)

Positive

Gain

Revert to previous configuration

Configuration

Not chosen

Probabilistically

Negative Gain

Probability of acceptance calculated

(Equation 2)

Configuration

Probabilistically

Chosen

Configuration

Change Finalized

Number of

Simulations

=250

Configuration

Finalized

Figure 1: Implementation Flowchart of Simulated Annealing Optimization s

 log p

 exp

( sim _ num )

(

1

Eff _

1

Gain

 s

10 )

(2)

In equation 2, ‘sim_num’ is the number of simulations and p is the probability of acceptance of a configuration. As an additional clause, the probability of acceptance of a configuration having negative gain can never exceed the value 0.5. The probability function is plotted against simulation index for various values of gain in figure 2. As illustrated in the figure, the probability of selecting configurations yielding negative gain, at the beginning of optimization process is higher than its probability at the concluding stages. This feature ensures that a large portion of design space is explored and the algorithm, more or less converges to an optimized configuration at the end of the optimization process. It is important to note that, since the value of scale in equation 1 has been set to ‘50’, a gain of -10 corresponds to a 20% increase in APPI value while the effect of IPC is negligible. During the entire optimization process, a total of 250 simulations are conducted and the configuration yielding maximum gain is selected.

These simulations were run on Pentium-4 2.0 GHz machines with 512 MB RAM and

Ubuntu (Linux) operating system. Time required to run 250 simulations varied between 6 to 12 hours depending upon the benchmark. The set of parameters and their values that simulated annealing considers while effecting configuration changes are given in Table 2.

As is evident from the table, a very wide range of configuration parameters and their values have been selected. This is in contrast to the approaches presented in [2][3][4][5], contributing to a relatively higher level of optimization.

0.5

0.4

0.3

0.2

0.1

0

0

0.9

0.8

0.7

0.6

Probability = 0.5 Reference Line gain= - 2 gain= - 3 gain= - 5 gain= - 10

50 100 150

Simulation Index

Figure 2: Probability function

200 250

15

16

17

18

5

6

7

Sr No Parameter

1 IFQ

2

Configuration

1, 2, 4, 16, 32

Branch Table 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384

3

4

RAS

BTB

Table 2: Parameter Values

16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192

16 4, 32 4, 64 4, 128 4, 256 4, 512 4, 1024 4, 2048 4, 4096 4,

8192 4

Decode Width 1, 2, 4, 16, 32

8

9

10

Issue Width

Commit Width

RUU

LSQ

I Cache

1, 2, 4, 16, 32

1, 2, 4, 16, 32

8, 16, 32, 64, 128, 256, 512, 1024

8, 16, 32, 64, 128, 256, 512, 1024

4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l,

256:32:4:l, 1024:32:4:l, 2048:32:4:l, 8192:32:4:l

11

12

13

14

D Cache

Bus Width

I TIB

4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l,

256:32:4:l, 1024:32:4:l, 2048:32:4:l

4, 8, 16, 32, 64

1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l,

32:1024:4:l, 64:1024:4:l, 128:1024:4:l

D TLB

I ALU

I Mul/Div

1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l,

32:1024:4:l, 64:1024:4:l, 128:1024:4:l

1, 2, 4, 8

1, 2, 4, 8

Memory Ports 1, 2, 4, 8

FP ALU 1, 2, 4, 8

19 FP Mul/Div 1, 2, 4, 8

The optimization process has been automated through the development of a software tool. The tool basically comprises of a script file and an ANSI C file. The primary function of script file is issuing commands to simplescalar and transferring control to C file and the software module which gives area estimates. All other operations of the simulated annealing algorithm are managed by the C file. This includes: (a) analyzing simulation results; (b) calculating probability of acceptance incase of negative gain; (c) selecting a given configuration change randomly from the set defined in Table 2; (d) finalizing configuration, either as a result of positive gain or probabilistically incase of negative gain; and finally (e) identification of optimization termination point.

4. RESULTS AND COMPARISONS

4.1. Optimization Results

The results of optimizations are given in figure 3 through figure 7. Results indicate the evolution of ‘gain’ for both SSOPT and simulated annealing algorithms against simulation index. In equation 1, since the value of scale has been set to 50, APPI gain value has a larger contribution to the actual gain value. For the results given in this section, gain has been evaluated using equation 3.

gain

APPI _ gain

1 .

96

IPC _ gain

0 .

04

(3)

It is evident that in most cases, simulated annealing performs much better than the methodology followed by SSOPT. The primary reason for this is that, due to the heuristic nature of its algorithm, SSOPT tends to converge to local optimal points. The final configuration is dependant on the initial configuration; and hence only a small portion of the actual design space is explored. Simulated annealing however does not have this limitation, and the possibility of probabilistically selecting configurations which give negative gain ensures that a large portion of design space is explored. However, this random nature of simulated annealing algorithm makes further optimization possible.

This is main reason why SSOPT outperforms simulated annealing in FFT results.

350

300

250

200

150

100

50

0

-50

1

-100

26 51 76 101 126 151 176

SA

201 226

SSOPT

Figure 3: FFT Result

400

300

200

100

0

-100

1

-200

400

350

300

250

200

150

100

50

0

-50

-100

1

400

300

200

100

0

-100

1

-200

26 51 76 101 126 151 176 201 226

SA SSOPT

Figure 4: IFFT Results

26 51 76 101 126 151 176 201 226

SA SSOPT

Figure 5: CRC Results

26 51 76 101 126 151

Figure 6: PCM Decode Results

176

SA

201 226

SSOPT

500

400

300

200

100

0

-100

1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241

SA SSOPT

Figure 7: PCM Encode Results

5. HYBRID METHODOLOGY

Based on the simulation results of last section, a hybrid approach has been devloped which ensures better optimization than both approaches presented in sections 2 and 3.

The objective is to select the architectural configuration which gives maximum gain in simulated annealing results. The selected configuration is then used as an initial configuration for SSOPT algorithm. The flowchart for hybrid approach is given in figure

8. This approach utilizes the strong points of both optimization algorithms. The first phase, comprising of simulated annealing algorithm helps converge to a more or less global optimal point. The later stage comprising of SSOPT, fine tunes the configuration given by simulated annealing algorithm and gives further performance improvement.

Results of this approach are given in figure 9 through figure 13 and illustrate the evolution of gain versus simulation index. The value of gain is calculated according to equation 3.

Initial

Configuration

Simulated

Annealing Applied

(Section 3, Figure1)

Architectural configuration with maximum gain selected

SSOPT applied

(Section 2)

Figure 8: Hybrid Methodology Flowchart.

Configuration

Finalized

500

400

300

200

100

0

-100

1

-200

400

350

300

250

200

150

100

50

0

-50

-100

1 51 101 151 201 251 301 351

Gain

401

Figure 9: FFT

350

300

250

200

150

100

50

0

-50

-100

-150

1

Gain

51 101 151 201 251 301 351 401 451 501

Figure10: IFFT Results

Gain

51 101 151 201 251 301 351

Figure 11: CRC Results

500

400

300

200

100

0

-100

1

Gain

51 101 151 201 251 301 351 401

Figure 12: PCM Encode Results

500

400

300

200

100

Gain

0

1 51 101 151 201 251 301

-100

Figure 13: PCM Decode Results

It is evident from the results depicted in figures 9 – 13 that the proposed hybrid optimization algorithm converges to architectural configurations having greater level of optimization than either simulated annealing approach or SSOPT. It displays performance improvements in all the considered benchmark applications; with the exception of PCM decode in which case, the final configuration acquired through simulated annealing algorithm already has a very high level of optimization. The performance measures of optimized architectural configurations are summarized in Table

4.

The architectural configuration having maximum transistor count is the optimized configuration corresponding to PCM encode operation. It is estimated as 16,258,447 transistors; which is less than the transistor count for Pentium IV processor (estimate

42,000,000 transistors). This makes the practical implementation of our final configurations possible. The final architectural configurations for different DSP benchmarks are given in Table 5.

Bench

FFT

IFFT

CRC

PCM E

PCM D

IPC

1.4528

1.1156

1.4312

0.9895

1.0516

Table 4: Performance Measures

APPI Perf (GIPs) Mips/W Transistors

4.4478

4.8753

1.4528

1.1156

3.6791 1.4312

4.1501 0.9895

3.8413 1.0516

224.8255

205.115

271.8122

240.9653

260.3357

10802573

3312533

4194699

16258447

1494591

Configuration Parameter

Instruction Fetch Queue

Table 5: Architectural Configurations

FFT

16

IFFT

4

CRC

32

Branch Table Size

Return Address Stack

16384 32768 4096

8 16 8

Branch Target Buffer 1024

Instruction Decode Width 32

1024

16

256

32

Instruction Issue Width 2

Instruction Commit Width 4

Register Update Unit 16

2

32

16

2

4

16

Load Store Queue

D Cache

I Cache

Memory Bus Width

Instruction TLB

Data TLB

Integer ALUs

Integer Mul/Div

16 8

512Bytes 2 KB

32KB

8KB

2

1

32KB

16 KB

4

1

8

1 KB

PCM E

32

256

32

2

32

8

PCM D

2

65536 2048

64 4

128

32

2

16

8

2KB 4KB 512Bytes 512Bytes 512Bytes

64 Bytes 64 bytes 64bytes 64bytes 8bytes

16KB

16KB

2

2

4

8Kb

16KB

4KB

2

4

8

512Bytes

8KB

64KB

2

1

Memory to System Ports 4

Floating Point ALU 1

Floating Point Mul/Div 4

2

1

4

4

1

1

2

1

1

1

1

1

It should be noted that the results reported in this section cannot be directly compared with the optimization results of previous approaches. This is primarily because none of the optimizations covered in [2 - 5] have been conducted on MiBench Benchmark suite

[1]. Furthermore, the initial configuration over which the value of gain is computed is different from the starting point in the former approaches. Similarly, Simplescalar and

Wattch have been used in this work for determination of performance. Whereas, in other cases (e.g. [2]) in-house customized tools have been used. As far as performance improvements are concerned, 60% increase in performance along with 50% increase in power consumption was reported in [2]. Our work targets high efficiency designs and in all cases an improvement of more than 200% in processor efficiency over initial architectural configuration has been observed. Performance versus area tradeoff has also been reported in some prior work (e.g. [3]). In our approach, complexity estimates do not form part of the cost function used for validating configuration changes. The transistor

counts of final obtained configuration have however been provided to validate the practical implementation aspects.

5. FUTURE WORK

Presently I am working on minimizing the optimization time for this optimization tool.

For this, several measures have been taken. A new feature has been added to the optimization tool called “configuration identifier”. It records the architectural parameters for a given configuration and its performance results. During the optimization process, if the configuration is encountered again, then it is not going to be simulated and the results from the previous configuration result will be recorded. The simulated annealing stage of this framework has a high tendency of re simulating a given architectural configuration since the configurations are selected on a random basis. Given this, the configuration identifier measure should save significant optimization time. Furthermore, the configuration parameters considered by simulated annealing approach (Table 2) have been limited based on the optimization results. This should help to decrease the simulation cycles required by simulated annealing for converging to an optimal architectural configuration. However, the probability function given in equation 2 has to be changed to ensure convergence at simulation index < 250. Both of these measures have been implemented; however, the results for them are presently pending.

6. CONCLUSION

In this paper, two superscalar optimization approaches have been compared. The first approach systematically changes architectural parameters based on heuristics. The second approach is based on the simulated annealing algorithm; randomly selecting architectural configurations from a large configuration set. Based on the optimization results of these approaches, a hybrid approach has been proposed, which provides major performance improvements. DSP benchmarks from MiBench benchmark suite have been used for conducting optimizations. Efficiency gain of more than 200% over initial processor configuration is observed for all benchmark applications. The resultant architectures have efficiency of up to 271 MIPs/W while giving performance of up to 1.4312 GIPs. The complexity of all optimized configurations is less than the transistor count of Pentium IV processor.

7. REFERENCES

[1] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, R.B.Brown,

“MiBench: A free, commercially representative embedded benchmark suite”, IEEE

International Workshop on Workload Characterization, 2001, pp 3-14.

[2] T.M. Conte, K.M. Menezes, S.W. Sathaye, “A technique to determine powerefficient, high-performance superscalar processors”, Proceedings of the 28th Hawaii

International Conference on System Sciences, pp. 324-333, Vol 1, 1995.

[3] S.V. Haastregt and P.M.W. Knijnenburg, “Feasibility of Combined Area and

Performance Optimization for Superscalar Processors Using Random Search”,

Proceedings of Design Automation and Test in Europe, 2007, pp. 1-6.

[4] Mauro Olivieri, “A genetic approach to the design space exploration of superscalar microprocessor architectures”, Proceedings of the 2001 IEEE International Symposium on Circuits and Systems, pp. 69-72, Vol 5.

[5] V. Zyuban and P. Kogge, “Optimization of high-performance superscalar architectures for energy efficiency”, Proceedings of the 2000 International Symposium on Low Power Electronics and Design, 2000, pp. 84-89.

[6] A. Asaduzzaman, I. Mahgoub, P. Sanigepalli, H. Kalva, R. Shankar, and B. Furht,

“Cache Optimization for Mobile Devices Running Multimedia Applications”,

Proceedings of the IEEE Sixth International Symposium on Multimedia Software

Engineering, 2004, pp. 499-506.

[7] A. Ghosh, T. Givargis, “Cache Optimization for Embedded Processor Cores: An

Analytical Approach”, International Conference on Computer Aided Design, 2003, pp.

342-347.

[8] F. Sheikh, S. Masud, R. Ahmed, "Superscalar Architecture Design for High

Performance DSP Operations", Microprocessors and Microsystems, Elsevier

(accepted for publication) http://dx.doi.org/10.1016/j.micpro.2008.10.002

[9] T. Austin, E. Larson, D. Ernst, “SimpleScalar: An infrastructure for Computer

System Modeling”, IEEE Computers, Feb. 2002, pp. 59-68.

[10] D. Brooks, V. Tiwari, M. Martonosi, “Wattch: A Framework for Architectural-Level

Power Analysis and Optimization”, Proceedings of 27th International Symposium on

Computer Architecture”, 2000, pp 83-94.

[11] M. Steinhaus, R. Kolla, J. Larriba-Pey, T. Ungerer, and M. Valero “Transistor count and chip-space estimation of simplescalar-based microprocessor models”, Proceedings of the Workshop on Complexity-Effective Design, 2001.

[12] S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi, “Optimization by Simulated

Annealing”, Science, May 1983, Volume 220, Number 4598, pp 671 - 680.

Download