Integrated On-chip Storage Evaluation in ASIP Synthesis

advertisement
Integrated On-chip Storage Evaluation in ASIP Synthesis
Manoj Kumar Jain, M. Balakrishnan and Anshul Kumar
Department of Computer Science and Engineering
Indian Institute of Technology Delhi, India
{manoj,mbala,anshul}@cse.iitd.ernet.in
ABSTRACT
An Application Specific Instruction Set Processor (ASIP) exploits
special characteristics of the given application(s) to meet the desired performance, cost and power requirements. Performance estimation which drives the design space exploration is usually done
by simulation. With increasing dimensions of the design space,
simulator based approaches become too time consuming. In the domain of Application Specific Instruction set Processors (ASIP), this
problem can be solved by approaches which perform only scheduling for performance estimation and avoid code generation. However, existing scheduler based approaches do not help in exploring
on-chip storage organization.
We present a scheduler based technique for exploring the register
file size, number of register windows and cache configurations in
an integrated manner. Performance for different register file sizes
are estimated by predicting the number of memory spills and its
delay. The technique employed does not require explicit register
assignment. Number of context switches leading to spills are estimated for evaluating the time penalty due to a limited number of
register windows and cache simulator is used for estimating cache
performance.
1.
INTRODUCTION AND OBJECTIVES
An Application Specific Instruction Set Processor (ASIP) is a
processor designed for one particular application or for a set of
specific applications. An ASIP exploits special characteristics of
the given application(s) to meet the desired performance, cost and
power requirements. ASIPs are a balance between two extremes,
Application Specific Integrated Circuits and general programmable
processors. ASIPs offer the required flexibility (which is not provided by ASICs) at a lower cost and power than general programmable
processors. Thus, ASIPs can be efficiently used in many embedded
systems such as servo-motor control, automotive controls, game
devices, network routers, avionics, cellular phones etc.
A typical ASIP design flow includes key steps as application
analysis, design space exploration, instruction set generation, code
generation for software and hardware synthesis [1]. Design space
exploration is driven by performance estimations. These estimates
are generated using a simulator based [2, 3] or scheduler based
framework [4, 5]. Simulator based technique needs a retargetable
compiler to generate code for different processor configurations to
be explored. Further, simulating the generated code is slow. There
is a well known trade-off between retargetability and code quality
in terms of performance and code size compared to hand optimized
code. Therefore, in our opinion, simulation based approach is not
suitable for early design space exploration.
On the other hand, in scheduler based approach, the application
code is scheduled, on an abstract model of the processor being explored, to get an estimate of the execution time. Therefore, it is
much faster and quite suitable for an early design space exploration.
However, on-chip storage which includes register files and cache
is not explored by the scheduler based approaches reported so far.
Our previous study [6] using a retargetable code generator and stan-
dard simulator has indicated that choice of an appropriate number
of registers has a significant impact on performance and energy in
certain cases. In ASIP synthesis, suitable hardware is chosen for
given applications. As we have limited on-chip space and storage
consumes a significant part of the total chip-area, it is important to
select appropriate on-chip storage. For example trade-off between
register file size and cache size can be evaluated. Similarly, for a
given register file size, trade-off between number of register windows and window size can be evaluated.
Main focus of our work is to include on-chip storage exploration
as part of the design space exploration. To do this, we need to
take into account the influence of storage architecture on the performance or execution time. Our approach is to first find the execution time ignoring the influence of storage constraints on it and
then add overheads due to register spills because of limited register
file, window spills and restores due to limited register windows and
cache misses.
In scheduler based approach, register allocation is the key step
which helps in determining the influence of limited register file
size on the performance. Most of the approaches suggested for
register allocation perform register allocation either after scheduling like a typical compiler, or they try to solve register allocation
and scheduling in an integrated manner. Goodman et al. [7] suggested DAG driven technique for register allocation, but their approach needs a pre-pass scheduling. Rajiv Gupta et al. [8] propose
to solve register allocation and instruction scheduling in an integrated manner. However, register allocation before scheduling is
desirable when minimization of register file size is more important
than the length of the code sequence. Our technique does register allocation before scheduling using the concept of register reuse
chains [9] with significant extensions.
A number of techniques for memory exploration are reported in
literature [10, 11, 12]. To the best of our knowledge, no approach
considers register file as a part of storage space exploration.
Our integrated on-chip storage evaluation methodology is presented in the next Section. Techniques for exploring register file
size, number of windows and cache size are described in Section
3. Exploration and trade-off results with real life examples are presented in Section 4. Applications of proposed approach are discussed in Section 5 with conclusions in the last Section.
2. INTEGRATED ON-CHIP STORAGE
EVALUATION METHODOLOGY
Storage exploration is a part of the design space exploration phase
of overall methodology. Proposed technique for storage space exploration is shown in figure 1. Cycle count for application execution on the chosen processor and memory configuration is estimated. A parameterized model for processor as well as memory is
considered. Parameters of data cache include size, line size, associativity, replacement policy and access time. Processor configuration specification includes register file and windows organization
along with pipeline information and functional unit (FU) operation
capability and latency.
Register allocation is done on unscheduled code using the concept of reuse chains [9] with significant extensions [13]. The proposed technique for register allocation is briefly described in the
next Section. We have defined cost of merging of reuse chains considering spills. We have also developed systematic way of merging
these reuse chains. A priority based resource constrained list scheduler is used for performance estimation. Further, we have proposed
a novel technique for global performance estimation based on usage analysis of variables. Global performance estimation is done
without code generation.
Further, we estimate overheads due to limited register windows
and data cache memory. We have integrated this technique to explore register file size, windows and cache configurations.
Overall execution time estimate (ET ) for an application for the
specified memory and processor configuration can be expressed as
follows.
ET = etR + ohW + ohC
(1)
Where
etR : Execution time when register file contains R registers.
ohW : Additional schedule overhead due to limited windows.
ohC : Additional schedule overhead due to cache misses.
etR can be further expressed by the following equation.
etR = bet + ohdep + spillR ∗ tR
(2)
Where
bet : Base execution time considering constraints of resources
other than storage.
ohdep : The overhead due to additional dependencies inserted during register allocation.
spillR : The number of register spills.
tR
: The delay associated with each register spill.
Computation of etR is described in the next Section. ohW can be
further expressed by the following equation.
ohW = spillw ∗ tW
(3)
Where
spillw : Number of window spills and
tW
: Delay associated with each register window spill.
ohC can be further expressed by the following equation.
ohC = missC ∗ tC
Input application (written in C) is profiled using gprof to find
execution count of each basic block as well as functions. These
execution counts are used to multiply with the estimated execution
times. Intermediate representation is generated using SUIF [14].
Control and dependency analysis is done using this intermediate
representation. Control flow graph is generated at the function level
whereas the data flow graph is generated at the basic block level.
For each basic block B, local register allocation is performed
taking the data flow graph and number of registers to be used for
local register allocation (say k) as input using a modified register
reuse chains approach [13]. Data flow graph may be modified because of additional dependencies as well as spills inserted during
register allocation. This modified data flow graph is taken as input by a priority based resource constrained list scheduler, which
produces schedule estimates. This estimate is multiplied by the execution frequency of block B to compute local estimate (LEB,k ) for
this block.
Local estimates are produced for all the basic blocks contained
in a function, for the complete range of register file sizes to be
explored. Schedule overheads needed to handle global needs with
limited number of registers are computed using life time analysis of
variables. For each block, we need information on variables used,
defined, consumed, live at entry and exit points of this block. This
additional global needs overhead is also generated for the complete
range of number of registers for each basic block. Then, we decide on the optimal distribution of the registers available (say n)
into registers to handle local register allocation (k) and registers to
handle global needs (n − k), such that overall schedule estimate for
that block is minimized.
Overall estimate for a block B can be expressed as
OEB = mink (LEB,k + GEB,n−k )
EXPLORATION OF REGISTER FILE SIZE,
NUMBER OF REGISTER WINDOWS AND
CACHE SIZE
In this section we propose the techniques to estimate execution
time considering various design alternatives to be explored. As
mentioned earlier, register file size, number of register windows
and their sizes and cache size is considered while exploring on-chip
storage.
(5)
where OEB is the total schedule estimate for basic block B, and
GEB,n−k is the overhead to handle global needs with n − k registers.
OEB values for all blocks are summed up to produce estimates at
the function level. Estimates for all functions are added together to
produce overall estimate for the application i.e. etR . So etR can be
expressed as
etR =
(4)
Where
missC : Number of cache misses and
tC
: The cache miss penalty.
Computation of ohW and ohC is explained in the next Section.
tW is computed by knowing register window size and the latency
of ‘store’ instruction. tC is computed using block size and the delays associated in each data transfer. Storage configuration selector
selects suitable processor and memory configuration to meet the
desired performance by knowing all the execution time estimates.
3.
3.1 Execution Time Estimation with Limited
Registers
∑
f or each f unction
∑
(OEB )
(6)
f or each basic block B
3.2 Exploration of Number of Reg. Windows
Processors with register window scheme typically assume a set
of registers organized as a circular buffer. When a procedure is
called (means a context switch), the tail of the buffer is advanced in
order to allocate a new window of registers that the procedure can
use for its locals. On overflow, registers are spilled from the head
of the buffer to make room for the new window at the tail. These
registers are not reloaded until a chain of returns makes it necessary.
We consider the context switches that are due to function calls and
returns.
Vishal et al. [15] proposed a technique to estimate the number
of register windows spills and restores. This is based on generating a trace of calls and returns from the application execution. We
used their basic idea of instrumenting the application and generating such trace. After this, earlier approach generates finite automaton for each number of register windows. Trace was given as input
to these automaton one at a time, to compute number of window
spills and restores in each case. On the other hand, our approach
uses a very simple and efficient stack based analysis to compute
Application + Per. constraints
Window spill overheads
Register spill overheads
Cache miss penalty
* Function calls/ returns
sequence
* Reg. alloc. using RRC
* sim−cache used
* Global needs handling
* stack based analysis to
compute window spills
oh
* List scheduler used
* Results validated
et R
W
Parameterized
Memory
Parameterized Proc.
Including register file and windows
configurations
Storage Explorer
oh C
Execution time estimate
ET
Register file size
Cache configuration
Register windows configuration
Storage configuration selector
Figure 1: Storage exploration technique
window spills and restores. Trace is used only once and simultaneously spills and restores for different number of register windows
are computed.
Depending on the the total number of registers available and
number of register windows, the window size is decided. Register
window configuration will help us in deciding about the number of
registers which can be used for allocation within a context. We use
the techniques proposed in the last subsection to estimate execution
time when each function gets a specified number of registers for allocation. Window spills (spillW ) due to limited number of register
windows is computed as shown above. As mentioned earlier (equation 3), the execution penalty due to window spills and restores
(ohW ) is computed using window size and latency associated with
load and store operation for the processor. This additional penalty
is added into the estimates produced by the technique to produce
overall execution time estimates.
3.3 Exploration of Cache Size
Register file and register windows constraints are considered so
far. Now we consider cache to be a part of storage hierarchy for
design space exploration.
We observe that usually the number of memory locations required to store spilled scalar variables and register windows is small
compared to the total number of cache locations. Therefore, we assume that the spilling overhead is insensitive to the cache organization. This observation allows us to estimate the two independently.
To know the memory access profiles (total number of accesses,
hits, misses etc.) we need to generate addresses to which memory
accesses are made and then a simulator is required to simulate those
accesses. Since memory access patterns are typically application
dependent, so we can use any standard tool set to find memory
access profiles. Once we know the number of memory misses for
a particular cache, knowing the block size and delay information
we compute the additional schedule overhead due to cache misses
using the following equation.
d = α1 + α2 ∗ (b − 1)
where α1 is the time required (in terms of processor clock cycles)
for transferring first data from next level memory to cache and α2
is the time required for each of the remaining data transfers. b is
the block size. For our experiments, we have chosen α1 and α2 as
8 and 3 respectively. 3 clocks are required for each load/store for
LEON processor [16]. This additional delay is added with the base
execution time estimated from our methodology described earlier
to get the overall execution time estimates. Cache misses depends
on block size. On one end, small block sizes may not give the desired advantage of locality of references. On the other hand, using
larger block size leads to heavy miss penalty. While performing
exploration, for each cache size, we took different possible combinations of block sizes, associativity and replacement policies and
then, identified which combination gives the minimum misses.
We have used simplescalar tool set [17] to compute the number of cache misses for different cache sizes. It assumes a MIPS
like processor with some minor changes. Since we do not require
timing information from the simplescalar tool, we use sim-cache
simulator which is fast compared to sim-outorder. sim-cache does
not use processor timing information and it gives only cache miss
statistics. It takes only address trace along with cache configurations as input.
Since we have explored data cache here and large amount of data
is stored in form of array into memory, this array storage as well as
their sequence of access do not significantly depend on processor.
With this assumption, we have used sim-cache simulator to find
cache miss statistics.
4. RESULTS
4.1 Trade-off between Number of Register
Windows and Window Size
Figure 2 shows the impact of variation of register file size on execution time for quick sort. When these results are generated, only
a single register file of specified size is assumed and all the registers were allowed to be used for register allocation for any function.
Any schedule overhead due to limited number of register windows
was ignored while generating these estimates. Results indicate that
execution time decreases with an increase in the register file size,
but saturating after a certain register file size is reached. For this
application, 8 registers are sufficient. Beyond that, the cycle count
is not reduced further.
Variation of number of window spills and restores required with
different number of register windows is shown in figure 3. Curve
saturates at 9 windows meaning that, there would not be window
spills and restores when we have 9 or more register windows.
We are interested in trade-off between the number of registers
and window sizes. For each total number of registers, window size
would be different for different number of windows. While generating results (figure 4), we assumed that register file will be distributed in windows of equal sizes. We also assume that within
a context, number of registers available for register allocation is
equal to window size. Depending on the performance requirement,
suitable register file size can be chosen and for the chosen register
file size, number of windows and hence the window size (number
of registers in a window) can be decided.
On one end, when the number of windows is small, the time
overhead due to context switches dominates the cycle count. At
the other extreme, when the number of windows is large for the
same total number of registers, the individual window size becomes
small and the overhead due to load and stores (within a context)
dominates the cycle count.
ate values of α1 and α2 . Consider the results produced for matrixmult program for different register file size and memory configurations (figure 5). Some interesting trade-offs can be observed.
Based on the generated execution time estimates and the input performance constraint, suitable configurations can be suggested. For
example, if the application should not take more than 1.0E + 05
cycles, then one of the following configurations can be suggested.
1. 12 registers with 4K data cache
2. 15 registers with 2K data cache
3. 20 registers with 1K data cache
Execution time estimates for matrixmult
1.8E+05
1.7E+05
1.0E+05
500
9.0E+04
0
R29
R31
10
R25
9
R27
3 4 5 6 7 8
Number of register windows
R21
2
R23
8.0E+04
1
R17
6 7 8 9 10 11 12
Register file size
R19
5
R13
4
R15
3
R9
280000
1.1E+05
R11
300000
1.2E+05
R7
320000
1000
D1_1K
D1_2K
D1_4K
D1_8K
1.3E+05
R5
340000
1500
1.5E+05
1.4E+05
R3
# window spills and restores
Execution time (# cycles)
360000
quick_sort
2000
quick_sort
380000
Execution time (#cycles)
1.6E+05
Register file size
Figure 2: Impact of Register Figure 3: Impact of number
of registers windows
file size on execution time
Figure 5: Results for matrix-mult
420000
reg_12
reg_15
reg_16
reg_18
reg_20
Execution time (# cycles)
400000
4.3 Execution Time Validation
380000
360000
340000
320000
300000
280000
1
2
3
4
Number of register windows
5
6
Figure 4: Trade-off between number of windows and their sizes
4.2 Trade-off between Register File Size and
On-chip Data Cache Size
Execution time estimates for various benchmark applications for
different register file sizes and different data cache sizes were generated. We have not considered the impact of cache size variation
on memory latency, but it can be considered by choosing appropri-
Performance estimations with varying on-chip storage configurations for selected benchmarks applications were done. Three processors namely ARM (ARM7TDMI a RISC) [18], Trimedia (TM1000 a VLIW) [19] and LEON (a processor with register windows)
[16] were chosen for experimentation and validation. TM-1000’s
five-issue-slot instruction length enables up to five operations to be
scheduled in parallel into a single VLIW instruction. To know correctness of our techniques, we chose to validate our result against
the numbers produced by standard tool sets.
Validation shows that our estimates are within 10% compared
to the actual performance numbers produced by standard tool sets.
The actual figures were 9.6%, 3.3% and 9.7% for ARM7TDMI, TM1000 and LEON respectively. Further, this technique was nearly 77
times faster compared to the simulator based technique. Validation
results for LEON is shown in figure 6.
Results generated were also validated against VHDL level simulation for collision detection application which is developed at IIT
Delhi for detecting collision of an object with Camera [20]. The
execution time estimates produced by our estimator (443278 cycles) are within 10.33% compared to the estimates produced by
tsim (494375 cycles) and within 5.26% compared to the estimates
produced by VHDL simulation.
Normalized Estimates (million cycles)
1.2
width or providing more room for opcode so new application
specific instructions could be easily accommodated.
tsim-leon
ASSIST
1
0.8
• Reduction in the switching activity and thus saving in terms of
power consumption.
0.6
0.4
• In case, instruction width cannot be reduced by sparing some registers, these registers or their addresses could be used efficiently
for specific purposes. For example, hard-wiring some registers
with fixed values will help removing some moves. There are
other possibilities as well.
0.2
quick_sort (*4)
insertion (/3)
heap_sort (*4)
bubble_sort (/7)
matrix-mult (*1)
lattice (*15)
biquad (*35)
0
Now, we present a case study showing use of ‘spare’ registers in
smoothening co-processor interface to LEON processor.
Figure 6: Validation on LEON
4.4 ADPCM Encoder and Decoder Storage
Exploration
So far, we have generated results for small illustrative applications. Here, we present results generated for a real life benchmark
applications, namely, ADPCM encoder and Decoder which are part
of media benchmarks [21]. Data files used are clinton.pcm and
clinton.adpcm. The processor we have chosen is LEON [16].
To generate execution time estimate to explore storage organization, first we have generated execution time estimates using the
techniques proposed in this paper. Since high level of nesting is not
present in these applications, window spills would be minimum,
thus, window configuration exploration results are not included.
Exploration results for adpcm rawaudio encoder and adpcm rawaudio decoder are shown in figure 7 and 8 respectively. Results show
that increasing data cache size from 1K to 2K significantly improve
performance of encoder but not for the decoder. If encoder is to
complete its execution in 16 million cycles then we have to use
at least 2K data cache irrespective of register file size. We also observe that there is not significant performance improvement beyond
window size 11 for both the applications.
Execution time estimates for
adpcm_decoder
5.5E+07
2.5E+07
5.0E+07
2.2E+07
D1_1K
D1_2K
D1_4K
D1_8K
D1_16K
1.9E+07
1.6E+07
1.3E+07
Execution time (#cycles)
4.5E+07
D1_1K
D1_2K
D1_4K
D1_8K
D1_16K
4.0E+07
3.5E+07
3.0E+07
1.0E+07
R27
R30
R21
R24
R15
R18
R9
R6
R12
R30
R27
R24
R21
R18
R15
R9
R6
R12
R3
2.5E+07
Register file size
R3
Execution time (#cycles)
Execution time estimates for
adpcm_encoder
2.8E+07
Register file size
Figure 7: Results for adpcm Figure 8: Results for adpcm
rawaudio encoder
rawaudio decoder
5.
APPLICATIONS
Optimizing register file size is useful in many ways. Some of
them are listed here.
• If smaller register file can be used then register address needs
fewer bits. Thus, we can think of either reducing the instruction
5.1 Utilizing Spare Register Addresses to Interface Co-processors
In an interesting use, spare register addresses may be used to
address co-processor registers. Typically co-processors as well as
accelerators need to be interfaced to the processor for transmitting operands from processor to co-processor and results from coprocessor to processor. This can be very efficiently achieved if the
co-processor input/output registers are mapped to the register address space of the processor.
We have come across two different scenarios in interfacing RISC
architectures to co-processors. In case of MIPS, it allows direct
transfer of values from main register file to co-processor register
file and vice-versa. So operand values need to be moved from
main register file to co-processor register file and similarly, result values which are produced in co-processor registers need to
be moved to main register file. By applying proposed idea of using
addresses of ‘spare’ registers predicted by our technique, to address
co-processor registers can save such moves.
LEON provides floating point unit (FPU) interface to connect
any co-processor. It does not allow transfer of values directly between integer unit register file to floating point register file, so
the communication between processor and co-processor is through
memory. This means, the operand values are first stored into memory and then, these values are loaded into co-processor register file.
After co-processor had produced results, these results will be first
stored into memory and then, they are loaded into main register
file. By applying our idea of using addresses of ‘spare’ registers
predicted by our technique, significant benefits can be achieved in
this case. Because all such loads and stores can be avoided.
An approach is being developed for complete SoC synthesis at
IIT Delhi [22]. In this approach, the application C program is run
to get the profile to identify the computation intensive part of the
application. Synthesizable VHDL is automatically generated for
this computation intensive part along with the necessary interfaces.
We extend this synthesis approach for LEON processor based
ASIP. Level 1 data/instruction caches can be configured to meet
the need of the application which can be predicted using our techniques. Number of register windows can also be easily reconfigured by specifying a generic constraint in the top level VHDL description file. For the software synthesis we need a retargetable
compiler which can generate code for a chosen processor and memory configurations.
As mentioned earlier, we have chosen Collision Detection application for this study. We have generated VHDL description of
a co-processor for update sums and simulated the complete application using LEON VHDL simulator. Execution time reduced to
178573 cycles from 467912 cycles by use of co-processor.
We have used a FPU serial interface for co-processor. Main processor and co-processor cannot run in parallel in this interface. As
discussed earlier, in case of LEON, the parameters can be passed
only through memory. So all the parameters are to be stored into
memory when computed by integer unit and then loaded from the
memory into the co-processor. What we propose is to use addresses
of ’extra’ registers to address co-processor registers. The VHDL
code of the processor was modified such that the FPU register file
is synthesized as part of IU register file. We have just modified
the decoding of the register addresses of a couple of ’extra’ registers, so, now after decoding they point to FPU registers (figure 9).
Thus, when a co-processor parameter is generated by IU, it can be
directly written to one of the FPU registers. This way some loads
and stores from the memory are saved. In this case, each invocation of co-processor reduces from 36 cycles to 4. Simulation results
for collision detection using LEON and a co-processor has shown
that 11446 cycles are saved which is 6.4% of the total application
time by just using addresses of two ’extra’ IU registers to point to
FPU registers. The reduction is less because the co-processor is not
being invoked very frequently.
IU register file
IU reg. add.
decoder 1
decoder 2
coproc. reg. add.
{
{
IU
coproc.
coproc register file
Before
IU register file ‘spare’ registers
IU reg. add.
Modified
decoder 1
decoder 2
coproc. reg. add.
After
{
{
IU
coproc.
coprocessor register file
coproc. registers addressed by IU reg. address
Figure 9: Use of our Technique for coproc interface
6.
CONCLUSIONS AND FUTURE WORK
We have developed a complete strategy to explore on-chip storage architecture for Application Specific Instruction Set Processors. This work involves deciding a suitable register file size, number of register windows and on-chip memory configurations. Our
technique neither requires code generator nor a simulator for a specific target architecture. Further, the processor description required
for retargetting is very simple. Apart from on-chip storage related
parameters, we can also vary number and types of functional units.
The assumption is that we have a configurable and synthesizable
HDL description of a processor core which can be customized for
the application.
The proposed technique has been validated for several benchmarks over a range of processors by comparing our estimates to the
results obtained from standard simulation tools. The processors include ARM7TDMI, LEON and Trimedia (TM-1000). Usefulness
of our technique is also demonstrated by a case study where register
file address space “saved” by our approach is used for an efficient
co-processor interface.
Presently our technique has some limitations. Impact of register
file size on clock period is ignored. Further, compiler modifications
required to support variable register file are not considered. We
plan to address these limitations in future work.
7. REFERENCES
[1] M.K. Jain, M. Balakrishnan, and A. Kumar. ASIP Design
Methodologies: Survey and Issues. In Proc. of VLSI 2001,
pages 76–81.
[2] A. D. Gloria and P. Faraboschi. An Evaluation System for
Application Specific Architectures. In Proc. of Micro 23,
pages 80–89, 1990.
[3] J. Kin et al. Power Efficient Media processors: Design Space
Exploration. In Proc. of the DAC 1999, pages 321–326.
[4] T.V.K. Gupta et al. Processor Evaluation in an Embedded
Systems Design Environment. In Proc. VLSI 2000, pages
98–103.
[5] N. Ghazal et al. Retargetable Estimation Scheme for DSP
Architecture Selection. In Proc. of ASP-DAC 2000, pages
485–489.
[6] M.K. Jain et al. Evaluating Register File Size in ASIP
Design. In Proc. of CODES 2001, pages 109–114.
[7] J.R. Goodman and W.C. Hsu. Code Scheduling and Register
Allocation in Large Basic Blocks. In Proc. of ICS 1988,
pages 444–452.
[8] D.A. Berson, R. Gupta, and M.L. Soffa. URSA: A Unified
Resource Allocator for Registers and Functional Units in
VLIW Architectures. In Proc. of IFIP WG 10.3 (Concurrent
Systems) 1993, pages 243–254.
[9] Y. Zhang and H.J. Lee. Register Allocation Over a
Dependence Graph. In Proc. of CASES 1999.
[10] B.L. Jacob, P.M. Chen, S.R. Silverman, and T.N. Mudge. An
Analytical Model for Designing Memory Hierarchies. IEEE
Transactions on Computers, 45(10):1180–1194, October
1996.
[11] D. Kirvoski, C. Lee, M. Potkonjak, and W.H.
Mangione-Smith. Application-Driven Synthesis of
Memory-Intensive Systems-on-Chip. IEEE Transactions on
Computer Added Design of Integrated Circuits and Systems,
18(9):1316–1326, September 1999.
[12] S.G. Abraham and S.A. Mahlke. Automatic and Efficient
Evaluation of Memory Hierarchies for Embedded Systems.
In Technical Report HPL-1999-132, HPL Laboratories Palo
Alto, October 1999.
[13] M.K. Jain, M. Balakrishnan, and Anshul Kumar. An Efficient
Technique for Exploring Register File Size in ASIP
Synthesis. In Proc. of CASES 2002, pages 252–261.
[14] SUIF Homepage. “http://suif.stanford.edu/”.
[15] V. Bhatt, M. Balakrishnan, and A. Kumar. Exploring the
Number of Register Windows in ASIP Synthesis. In Proc. of
VLSI/ ASPDAC 2002, pages 223–229.
[16] LEON Homepage. ”http://www.gaisler.com/leon.html”.
[17] Simplescalar Homepage. ”http://www.simplescalar.com”.
[18] ARM Ltd. Homepage. “http://www.arm.com”.
[19] Trimedia Homepage. “http://www.trimedia.com”.
[20] S. K. Lodha and S. Gupta. A FPGA based Real Time
Collision Detection and Avoidance. In B. Tech. Thesis,
Department of Computer Science and Engineering, IIT
Delhi, 1997.
[21] MediaBech Homepage.
“http://cares.icsl.ucla.edu/MediaBench”.
[22] A. Singh et al. SoC Synthesis with Automatic Hardware
Software Interface Generation. In Proc. VLSI 2003, January
2003.
Download