Power Aware Embedded Operating System Design by Travis C. Furrer Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2000 2000 Massachusetts Institute of Technology. All Rights Reserved. Signature of Author .................................................. Department of Electrical Engineering and Computer Science May 19.2000 ............................ Anantha Chandrakasan Professor of Electrical Engineering C ertified by ........................................................................... sA iate hesis Su ervisor Accepted by .................................. ........... ................. . . ...................... . Arthur C. Smith Professor of Electrical Engineering Chairman, Department Committee on Graduate Theses MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUL 2 7 2000 LIBRARIES 2 Power Aware Embedded Operating System Design by Travis C. Furrer Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2000, in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science Abstract The ptAMPS low-power distributed wireless sensor project seeks to design small embedded systems that will require an operating system (OS). I chose to port the Embedded Cygnus Operating System (eCos) to the StrongARM 1100 microprocessor (SA-1100) so that it can be used on the pAMPS system. The OS was debugged and tested for execution from both RAM and ROM on an SA-1100 evaluation board. A description of how the OS was ported and debugged is given. A simple form of Dynamic Voltage Scaling (DVS) was implemented and energy-efficiency experiments were done. FIR filtering with a variable-length filter was used as a sample application to show that DVS provides a significant energy savings. Detailed results from these experiments are presented. Additional ideas for ways to save energy in the OS, left as future work, are included. Thesis Supervisor: Anantha Chandrakasan Title: Associate Professor of Electrical Engineering 3 4 Acknowledgements I would like to acknowledge and thank Anantha for inspiring me to work with his group, for arranging for my research assistantship funding, for giving me the freedom to organize my own project, and for his enthusiasm about each success along the way. I feel privileged to have worked with him and his group. I would also like to thank the following people and organizations for their invaluable help, without which I could not have completed this thesis: My parents, for their loving and wise direction that led me to pursue this degree, for always trying to give me the best, and for letting me call them to talk anytime about anything. Without their persistent encouragement I could not have made it through MIT. I hope they're able to stick around to see me through many more successes! I would also like to thank the rest of my family for serving as an encouragement and a good example to me while I was in college. Rex, Amit, Manish, SeongHwan, Jim, Eugene, Wendy, Alice, PaulPeter, and everyone else in Anantha's group who have contributed to the optimistic, courteous, friendly, humorous, and intelligent atmosphere here. I am impressed with each of you and your research. Special thanks to Rex who provided the DVS test board for use with the Brutus, and helped run the experiments. Special thanks also to Manish for help designing the variable-length filter experiment. Red Hat Software (formerly Cygnus Solutions), for giving me early access to their StrongARM 110 version of eCos. Thanks also to various people of the ecos-discuss@sourceware . cygnus . com mailing list, such as Andrew Lunn who provided source code to measure CPU load under eCos. The brains behind the zephyr help instance, who are available 24 hours a day to answer any question. It was from them that I learned many important skills that allowed me to do a project of this kind. Charles Leiserson, Harald Prokop, and Jamey Hicks for their involvement with my UROP research last year, which initially led me onto the topic of low power software (or "cool" software, as Charles says). My friend Ryan, for preventing much anguish for me by advising me (from experience!) to get my thesis written early. 5 6 "The harderI work, the luckier I get." - Alvin Furrer 7 8 Table of Contents 1 2 3 4 5 A B C D E Introduction............................................................................................................................................. 1.1 M otivation: The uAM PS Project .............................................................................................. 1.2 Background .................................................................................................................................. 1.2.1 General Low Power Software Techniques ................................................................ 1.2.2 Low Power OS Techniques ....................................................................................... 1.2.3 The StrongARM 1100 M icroprocessor ......................................................................... 1.2.4 Em bedded Real-Tim e Operating System s................................................................. 1.2.5 The Em bedded Cygnus Operating System ................................................................. 1.3 O verview ...................................................................................................................................... Porting the OS......................................................................................................................................... 2.1 Initial W ork on the Source Code ............................................................................................. 2.1.1 Copying the SA-I10 HAL as a Starting Point.......................................................... 2.1.2 Porting of Self-Contained Functions and M acros ..................................................... 2.1.3 O verview of the Boot Sequences ............................................................................. 2.1.4 Building the V irtual M emory Page Tables................................................................. 2.1.5 Enabling the M M U ................................................................................................... 2.2 The D ebugging Process ............................................................................................................... eCos Applications................................................................................................................................... 3.1 Early Dem o Program s.................................................................................................................. 3.2 Developing a D V S Demo ............................................................................................................ Ideas for Energy Efficiency Im provem ents......................................................................................... 4.1 DV S-Related Techniques............................................................................................................. 4.1.1 Choose the Right Boot Frequency.............................................................................. 4.1.2 Energy vs. Quality Scaling ......................................................................................... 4.1.3 Thread-based Voltage Scheduling.............................................................................. 4.2 N on-D V S Techniques.................................................................................................................. 4.2.1 K now your Hardware ................................................................................................ 4.2.2 Efficient/Restricted use of M em ory ........................................................................... Conclusion.............................................................................................................................................. D V S Test Board Specifications .............................................................................................................. A .1 A .2 A .3 eCos Control Signals from Brutus to D V S Test Board ..................................................................... Specification for D V S Test Board Inputs ................................................................................ A C Characteristics ....................................................................................................................... Regression Test Results................................................................................................................. The Big Picture of StrongARM Power Consum ption ............................................................................ Instructions ............................................................................................................................................. 15 15 15 16 18 20 23 23 25 27 27 29 31 32 36 . 43 45 47 47 47 51 51 51 51 51 51 51 52 53 55 55 55 57 59 61 D . 1 Building eCos for the Brutus .................................................................................................... D .2 Program m ing Flash M em ories................................................................................................ R aw D ata From Experim ents.................................................................................................................. 63 63 65 67 E. 1 Data from V oltage/Frequency Scaling Experim ent..................................................................... E.2 Data From V ariable-Length Filter D V S Experim ent................................................................... Photos ..................................................................................................................................................... 67 69 71 F B ibliograp hy ............................................................................................................................................. 9 73 10 List of Figures Figure 1.1: Diagram of a uAMPS Node .................................................................................................... Figure 1.2: SA-1100 Block Diagram............................................................................................ .. Figure 1.3: SA- 1100 Power and Clock Supply Sources and States During Power-Down Modes ............. Figure 1.4: Scalability of eCos [5]................................................................................................................. Figure 1.5: Modularity of eCos [5]................................................................................................................ Figure 2.1: SA-1100 vs. SA-110.................................................................................................................... Figure 2.2: eCos Configuration Tool............................................................................................................. Figure 2.3: eCos Configuration for Brutus Platform .................................................................................. Figure 2.4: SA- 1100 Memory Map for the entire 4Gb 32-bit Address Space........................................... Figure 2.5: Memory Layout for STUBS Startup ....................................................................................... Figure 2.6: Memory Layouts for RAM and ROM Startup ............................................................................ Figure 2.7: Before Enabling the MMU.......................................................................................................... Figure 2.8: After Enabling the MMU ............................................................................................................ Figure 3.1: Measured Energy per Operation vs. Frequency and Supply Voltage ..................................... Figure 3.2: Energy per Operation with Voltage Scaling vs. without Voltage Scaling ............................... Figure 3.3: Screen Shot from DVS Demo .................................................................................................. Figure A. 1: Transients in output voltage of DVS Board ........................................................................... Figure F. 1: Photo of Brutus Board................................................................................................................. Figure F.2: Photo of DVS Test Board ....................................................................................................... Figure F.3: Photo of Brutus with DVS Board connected ........................................................................... Figure F.4: Screen Shot of Graphical Demo on Brutus LCD ..................................................................... Figure F.5: StrongARM 1100 Chip Photo..................................................................................................... 11 16 21 22 24 24 29 30 31 37 41 42 43 44 49 50 50 57 71 71 71 72 72 12 List of Tables Table Table Table Table Table Table Table Table Table Table Table 2.1: Boot Sequence for RAM Startup .................................................................................................. 2.2: Boot Sequence for ROM Startup.................................................................................................. 2.3: Boot Sequence for STUBS Startup............................................................................................ 2.4: Virtual Memory Mapping as specified in Page Tables ........................................................... 3.1: Voltages used with DVS for each Clock Frequency ................................................................ A .1: V oltage C ontrol Signals............................................................................................................... A.2: DVS Test Board Voltage Control Signals ................................................................................ B. 1: eCos Regression Tests Results for Brutus Port........................................................................ C. 1: Breakdown of Power Consumption of SA- 110 processor when running Dhrystone ............... E. 1: Data from Voltage/Frequency Scaling Experiment ................................................................ E.2: Data from Variable-Length Filter DVS Experiment................................................................. 13 34 35 36 39 49 55 56 59 61 68 70 14 Chapter 1 Introduction 1.1 Motivation: The uAMPS Project The micro-Adaptive Multi-domain Power-aware Sensors (pAMPS) project [19] is being done by students in the Integrated Circuits and Systems group of the MIT Microsystems Technologies Laboratory. The project vision is to perform efficient distributed remote sensing using small wireless sensor nodes. Each node is battery-powered and contains a microsensor (for collecting remote data), an embedded microprocessor (for local pre-processing of sensor data), and a radio transceiver (for wireless transmission of sensor data to a central base station). The type of the sensor and other details will depend on the application, but the pAMPS system is being designed as a general substrate to be used in various remote sensing applications. Although the long-term vision is for the nodes to be fully integrated into a custom system-on-chip (SOC), the initial prototypes will be built with commercially available components, including the Intel StrongARM 11001 low-power microprocessor [9], assembled on a circuit board (Figure 1.1). My contribution to the pAMPS project has been to provide the operating system (OS) to control the software running on each node. The nodes have several special needs that constrain the type of OS that can be used. For example, because the signal from a sensor must be sampled at a precise rate even while other computing tasks are being performed, a real-time and multi-threaded OS is needed. More importantly, though, the OS needs to operate under strict power/energy constraints because the nodes are battery powered. The design and implementation of an OS that can meet the needs of the pAMPS project is the main topic of this thesis. 1.2 Background Some background will be helpful before discussing the actual project. I will first make some comments on the concept of low power software in general, and introduce an important low power software technique that can be implemented at the OS level specifically. After this comes a short look at the StrongARM 1100 microprocessor. Finally, I will comment on embedded real-time operating systems in general and explain why I chose ECOS as the OS to base my project on. 1. StrongARM and ARM are registered trademarks of Advanced RISC Machines Limited. 15 External Stimulus (application-specific) Sensor (acoustic, seismic, etc.) AiD Converter To/From Base Station or Remote Node DRAM (16Mb EDO) To PC (for debugging) Battery Figure 1.1: Diagram of a uAMPS Node 1.2.1 General Low Power Software Techniques Many techniques for energy efficiency [1, 30] are applicable at the level of hardware, but there are also methods that can be applied in software. 2 In this thesis, I use the term low power software to refer to software on which some sort of optimization has been done (whether at the algorithmic level in the source code, at lower levels by a compiler, or both) with the energy consumption of the system in mind. Traditionally, runtime has been used as the metric for most optimizations because performance is so important. In battery powered embedded systems, however, battery lifetime is at stake and thus energy consumption is at least as important as performance. The difference between energy consumption and power consumption is a distinction that needs to be highlighted. Although there are cases in which one might be interested in minimizing peak power consump- 2. It is worthwhile to make a special note of any techniques that are unique to software (no effort in hardware could accomplish a similar improvement). However, it is still useful to consider techniques that attempt to use software enhancements to make up for inefficient hardware. For example, the hardware available on a satellite is fixed, but more energy-efficient system software could perhaps be uploaded. 16 tion, for the pAMPS nodes we are interested in minimizing overall energy consumption (while maintaining the required functionality) since that is what matters for battery life. The term "low-power software" seems to imply simply that we are minimizing average power, but that would not be enough since time is also a factor in overall energy consumption. For this reason, the term "energy efficient software" might have been more accurate. The most obvious low power software technique is to make the software "lean" by eliminating inefficient code or unnecessary functionality that increases code size, which effects energy consumption by wasting both memory and clock cycles. Other techniques include: compiler enhancements for more energy efficient code generation, compiler enhancements for memory alignment of code and data for cache performance, algorithmic transformations for performance or for cache efficiency, algorithmic transformations for trading off energy vs. quality, as so on [8, 13, 26, 35, 36]. Optimizing software for energy consumption is not unlike optimizing for performance. In fact, very often these two can coexist: most transformations that decrease runtime (increase performance) will also reduce energy consumption. This is an important point to keep in mind. The synergy between these two types of optimizations is convenient. Also, when evaluating low power optimizations we need to be careful to distinguish between those that would already have been done traditionally for performance's sake, and those that are unique to low-power systems. An example of an OS optimization that would be done for the sake of energy, but never for performance, is to coarsen the grain of scheduling at times when the performance of all threads is non-critical (i.e. it is okay for threads to wait for a longer time while other threads run). Since context switches have a cost in terms of energy (due to the flushing of caches, saving and restoring state, etc.), minimizing them by coarsening the scheduling would save energy. But since performance is non-critical in these situations, it would make no sense to do coarsening for the sake of performance. Naturally, to perform low power software optimizations you need to have a detailed understanding of how energy is consumed by the hardware. While some components consume energy at a constant power level that cannot be significantly effected by software (i.e. leakage), the energy consumption of other components can be effected dramatically depending on how the software operates (for example, energy expended in SRAM due to cache misses). While there are large similarities between embedded systems that make certain low power optimizations generally useful, one should watch out for details of energy consumption in particular hardware that could effect optimization tradeoffs that are made with appropriate use of Amdahl's law. 3 For example, if your floating point unit is an off chip coprocessor and communicating with it is expen- 17 sive in terms of energy, it may actually be more efficient to perform certain floating point calculations using software emulation instead. There is clearly a different tradeoff between energy and performance in this case, as compared to one where floating point operations are energy-efficient, and thus the types of software optimizations used should be different. One of the frustrations in designing low-power software is that it almost always involves stripping away functionality that is considered unnecessary, but that seems useful. We may wish to retain some of this functionality at least part of the time. A power aware system attempts to provide different levels of quality/functionality (and thus energy consumption) at different times by intelligently monitoring the available system energy and activity, and user preferences (i.e. desired latency or quality). By providing high functionality/ performance sometimes and dynamically scaling down to more limited functionality at other times, the hope is to have the best of both worlds: average energy consumption is reduced without sacrificing peak performance. This technique is useful as long as the overhead associated with scaling functionality does not outweigh the average energy savings it provides. 4 1.2.2 Low Power OS Techniques Since the OS is at the heart of the system, it is likely in some cases to be an important place to apply general low power software techniques. We need to focus on the OS in addition to applications because it performs certain privileged system management activities that can have an important impact on energy efficiency. Also note that any techniques that happen to be platform specific (for example, those for which cache size or other details of the hardware are important parameters) are more naturally applied to the OS code since it is usually very platform-specific anyway (applications, on the other hand, may be compiled to run on several different platforms and therefore such techniques may not be as feasible). Research has been done on how to implement OS schedulers that make good use of the idle and sleep modes of both the processor and peripherals [14-18, 27, 31]. This is a "power aware" technique because the OS can actively manage power consumption by varying the set of enabled functional units while still meet- 3. Although Amdahl's law should always be kept in mind, we may sometimes need to consider energy efficiency in almost every area simultaneously in order to meet an absolute energy constraint. 4. There is overhead involved in executing the algorithms that are used to manage the scaling. There may also be a small amount of energy used by any additional hardware needed to support the scaling. If these overheads turn out to be large, it may have been better to use a traditional low-power design approach. 18 ing the needs of user applications. Implementing power awareness at the OS level is advantageous since it can improve energy efficiency even for applications that are not themselves power aware. There are several ways in which an OS can be power aware. Any of the common OS features might be implemented to vary their functionality based on the available energy or user demands. For example, a power aware file system/disk driver might provide synchronous disk access when necessary for robustness, but switch to asynchronous disk access at other times so that energy could be saved by clustering disk accesses together and minimizing the number of disk accesses (assuming that there is a constant component to the cost in terms of energy of each separate disk access). Another example might be a power aware scheduler that modifies the scheduling policy in such a way that allows dynamic trade-offs between performance and power consumption. The most important example of power awareness in the OS for this thesis, however, is the use of a technique called dynamic voltage scaling (DVS) 5 [22, 24, 33]. DVS takes advantage of an important property of static digital CMOS logic: the energy consumed to perform a given computation (i.e. to execute a sequence of instructions) is roughly proportional to the square supply voltage, as can be seen in the following formula [28]: Energy = CtOtV2D + IleakVDDM Where Cot is the total switched capacitance of the computation, VDD is the supply voltage, Ileak is the leakage current, N is the number of clock cycles taken by the computation, and f is the clock frequency. 6 Note that Co, is the sum of the total switched capacitances in each individual clock cycle. The total switched capacitance will be different for different clock cycles because different instructions are being executed and different functional units are being used. Here, we ignore these details by lumping everything into the single value CtotIt has been shown previously [3, 38] that the quadratic dependency on voltage can be exploited to provide significant energy savings by staticallyscaling down frequency and voltage together. DVS, on the other hand, takes advantage of this concept to scale frequency and voltage in the context of dynamically changing 5. It is interesting to note that Transmeta's LongRun TM technology [37] is essentially dynamic voltage scaling. The TM5400 can apparently scale from 500 MHz at 1.2 V to 700 MHz at 1.6V. 6. Recall that the maximum achievable frequency depends on voltage (this can be seen in Figure 3.1 on page 49), which constrains the combinations of values we can use to reach a desired energy efficiency. 19 system activity and requirements (i.e. at runtime). Because the fraction of time spent at different frequencies is highly application dependent, so are the actual energy savings that can achieved with DVS. In fact, there are even unusual cases in which DVS can actually be worse in terms of battery life. Battery life is sometimes shorter when drawing energy at a constant rate than when drawing the same amount of energy in periodic pulses [21]. In many cases DVS might fall short of the best-case (in terms of battery life) pattern of power consumption even despite its energy efficiency. The fundamental software problem of DVS is in implementing an algorithm that decides exactly what frequency/voltage level to use at each instant in time. Algorithms that do this are called voltage scheduling algorithms, and they are a topic of ongoing research [23, 25, 29]. The simplest such algorithms are intervalbased. Time is divided into equal intervals and the algorithm uses data (such as processor utilization in terms of idle cycles) gathered in previous intervals to determine the frequency/voltage for the next interval. Although such algorithms provide enough energy savings in many cases to justify the use of DVS, they are not optimal. Potential enhancements could involve mechanisms to dynamically vary the interval length, and to better predict the processor usage in the next interval by independently considering the activity of each thread (this is discussed at the end of [25]). 1.2.3 The StrongARM 1100 Microprocessor The Intel StrongARM 1100 processor was chosen for tAMPS for several reasons. Foremost is probably that it has a very high performance/power ratio. 7 However, it is also a good choice because it has built in capability for software controlled frequency scaling, which is needed for DVS. Supply voltage scaling, however, requires off-chip hardware which will be built into the ptAMPS board.8 The SA- 1100 is also nice because it has several integrated peripheral units. As you can see from Figure 1.2, there are general-purpose 1/0 pins (GPIO), 5 specialized serial ports, and a built in LCD display controller. In addition to software-controllable clock frequency, the SA-1 100 also has idle and sleep modes. The idle mode stops the clock to the ARM core, but most of the peripherals remain active. Power consumption in idle mode is lowered by about a factor of 5. The sleep mode shuts almost everything off, and power consumption is lowered by orders of magnitude. Figure 1.3 gives more detail about which functional units 7. The Hitachi SH3 processor has a higher performance/power ratio, but lower raw performance than the StrongARM. 8. The second generation StrongARM chips, due in mid 2000, will supposedly have voltage scaling built in also. 20 remain active is which modes. For complete details on the SA-1 100, consult the SA-1 100 developer's manual [9]. -- r -- 3.686 Os -- - I; Instruction tes (1 32768 OSCDDUao PC and ARMTm* SA-1 Core I Read Buffer I I Addr (8bye)E Minicache IDMMU Misc Test Load/Store Data I Processing OS Timer GeneralPurpose I/O Interrupt Intel* Strnn ARML SA-1100 JA . (16 Kbytes) IMMU 1 PL . Core Write Buffer System Bus System Control -r Module i-(SCM) I Management Reset Controller 1 ~ Memnory I Cont...l. Cont-- - -DMA 90ontrollerl Module -- LCD Controller (MPCM) I I P-erip e l Control MOdule (PCM) Peripheral Bus SSerial Channel 0 UjSB + Channel 1 SDLC erial Channel 2 IrDA Serial Channel 3 UART (from Intel SA- 1100 Developer's Manual) Figure 1.2: SA- 1100 Block Diagram 21 + Serial Channel 4 CODEC 4 I Power Management Mode Supply Source Run Idle Sleep Module Pwr CIk Pwr Clk Pwr Pwr CIk Disabled Stopped On Running CIk CPU MMUs (l&D) Stopped Write buffer Read buffer JTAG VDD 3.6864 MHz OS timer LCD controller On Serial channel 0-4 Running On Memory and PCMCIA Running control Real-time clock Interrupt controller Power manager VDDX 32.768 kHz General-purpose I/O (from Intel SA- I100 Developer's Manual) Pin pads Figure 1.3: SA- 1100 Power and Clock Supply Sources and States During Power-Down Modes Software development for the StrongARM 1100 is done on the "Brutus" evaluation board [10, 11]. Once the software works on the Brutus, porting it to the initial pAMPS board should not be difficult since it will be similar to the Brutus. The Brutus computer was designed as a test platform to demonstrate almost all of the capabilities of the StrongARM 1100 microprocessor. Its components include: " SA-1100 Microprocessor - Memory System (16Mb DRAM, 512K SRAM, 256K Flash, 256K ROM) - Two PCMCIA Slots * 320x240 Color LCD Screen - Audio Accessories (microphone, speaker) " HEX LED Display (one digit) - Touch Screen - Keyboard " Two RS-232 Serial I/O Interfaces Since most of these components will not be present on the ptAMPS board, direct OS support (i.e. drivers) for them will not be needed. However, these peripherals (especially the LCD screen and keyboard) are helpful for providing direct interaction with demonstration applications that are meant to be run only on the Brutus. 22 1.2.4 Embedded Real-Time Operating Systems Some of the applications of pAMPS will be real-time. 9 Rather than writing an embedded real-time operating system (RTOS) of our own from scratch, we chose to start with an existing RTOS and make modifications as needed. The current state of RTOS's is amazingly diverse; there are well over 100 different available RTOS's [5, 34] with many distinguishing factors. For pAMPS, we need an open source RTOS because we intend to modify the source code to add power aware features. StrongARM support is preferable, so that we don't have to port it ourselves. We need scalability since we want to be able to have very lean code (no unneeded features). Preferences regarding other distinguishing features are less important. 1.2.5 The Embedded Cygnus Operating System After reviewing a long list of RTOS's, my conclusion was that the Embedded Cygnuslo Operating System (eCos) [6], from Cygnus Solutions, would be suitable as a starting point for use with our pAMPS prototype. The features of eCos that make it attractive for this project are: - scalability: eCos has over 200 configuration options (which can be chosen using a handy configuration tool, seen in Figure 1.4) for fine grain scalability, and code size can be as small as a few kilobytes. - compatibility: eCos has pITRON compatibility, and will soon have EL/IX [7] compatibility which makes it more compatible with Linux. " multi-platform: Should we ever choose to stray from StrongARM, eCos is more likely to support our next choice of platform. - modularity: eCos is implemented with a hardware abstraction layer (HAL) that makes it easier to port to new platforms. It is also implemented in such a way that makes it easy to plug in a custom scheduler, device driver, etc. (see Figure 1.5). " open source: eCos source code is freely downloadable at sourceware.cygnus.com * development toolchain: eCos uses the standard GNU toolchain " support: There is an active mailing list (ecos-discuss) for free support. Increasing volume on this mailing list indicates that eCos is becoming more popular. 9. For a concise introduction to the topic of real-time systems, look at [32]. For more than you probably want to know about rate-monotonic real-time scheduling, refer to [12]. The topic of real-time systems has been studied for decades and is rather advanced, and thus I do not attempt to discuss it in this thesis. 10. ECOS was named prior to the acquisition of Cygnus Solutions by Red Hat Software, which happened in January 2000. 23 Configuration and Build Tools Application-specific operating system eCos Kernel Components st Interrupts Schedulers Exception Handlin Acd-ns Memory AlocatIon Synch Dr!vers F Libraries (from Cygnus eCos Market Backgrounder) Figure 1.4: Scalability of eCos [5] Figure 1.5: Modularity of eCos [5] 24 Unfortunately, eCos does not currently allow the dynamic loading of code. Until eCos supports this feature it will not be easily possible for us to download additional application code onto a ptAMPS node after deployment. Instead, the entire OS and application (which are actually compiled together into the same binary image) must be replaced together. Other OS's that were considered did not appear to meet our needs as well as eCos. For example, embedded Linux has real time support, but is not nearly lean enough and is not scalable (it is not trivial to eliminate the file system, for example). The uCLinux OS is leaner, but has not yet been ported to any ARM platforms. ChorusOS (from Sun) is not open source. LynxOS has an interesting patented interrupt handling mechanism, but is not open source and also doesn't support ARM. RTems is open source and actually has advantages over eCos, but has no ARM support. Several other RTOS's were considered and all had similar issues. 1.3 Overview Chapter 2 gives details about what is involved in porting and debugging an embedded operating system such as eCos. Chapter 3 describes the DVS demo application that I developed. Some ideas for energy efficiency improvements are given in Chapter 4. After the conclusion in Chapter 5, several appendices give useful data and more information about the OS and DVS experiments. 25 26 Chapter 2 Porting the OS It is of course necessary to get the basic OS features working before any experimentation can be done with special low-power OS features like DVS. This chapter describes how eCos was ported to run on the Brutus. A lot of low-level details had to be taken care of to get eCos to boot properly. 2.1 Initial Work on the Source Code Fortunately, eCos is designed with portability in mind. Most of the kernel is written in C++ and is entirely portable. The platform dependent code is isolated in what is called the Hardware Abstraction Layer (HAL) and is written in C. Porting eCos to a new platform means creating a HAL for that platform. The HAL code contains the routines necessary to boot and initialize the system. It also contains several functions and macros that are used by the rest of the eCos code. Although there are many features in common between microprocessors, the method of accessing and controlling the features is different for each processor. For example, most platforms have caches, interrupts, timers, and MMU's. The HAL presents a common interface for these features to the platform dependent part of eCos, through various functions and macros. Most of these functions are short and self contained (it doesn't take very many instructions to mask an interrupt, for example), and thus are simple to port. Since there are often many different platforms based on the same architecture (for example, ARM, Intel, and Cirrus Logic all make their own chips based on the ARM instruction set), the HAL is divided further into sections that are architecture-specific and sections that are platform-specific. This made the job of porting eCos to the Brutus easier because the ARM architecture was already supported. The parts of the HAL that actually needed to be ported included about 5000 lines of C and ARM Assembly code scattered across no more than two dozen files. There are two different ways to run eCos code: from RAM or from ROM. Running code from ROM requires first programming the ROM, while running code in RAM requires first downloading the code into RAM. Since downloading code into RAM is faster and easier than reprogramming a ROM, it is generally desirable to perform debugging on code in RAM. However, since the SA- 1100 always boots from ROM,' 1 11. When the SA- 1100 is powered on, it begins executing code starting from address zero, which happens to fall in the ROM area of memory. This makes sense because RAM will normally contain garbage at boot time, and thus the boot code could not be run from there. 27 there is a small portion of code that must run from ROM and implement some sort of download protocol to allow other code to be downloaded and run in RAM. For the ARM Software Development Toolkit (SDT), this small portion of code is called Angel. For eCos, which uses the GNU tool chain, this code is called a GDB Stub. The protocol is specific to the GNU Debugger (GDB) which is used on a PC to download the code to the Brutus over an RS-232 connection. It is called a "stub" because it only implements a small subset of the gdb remote debugging protocol -just enough to allow code to be downloaded and run from RAM. 12 The first task in porting eCos was to get a working set of GDB stub ROMs. Since much of the initialization code in the HAL is necessary even for GDB stubs, most of the HAL had to be ported before GDB stubs would work. All 5000 lines of code had to be ported before any real testing or debugging could be done. Fortunately, porting the GDB stubs is the majority of the work that needs to be done to get all of eCos working (except for several additional bugs that needed to be fixed), since the remaining code is platform independent and already debugged on other platforms. Since GDB has a feature for using the ARM debugging protocol (adp), it can connect to a Brutus running Angel. Therefore it would be possible to use Angel to debug eCos in RAM instead of first debugging the GDB stubs. However, various issues in connecting to Angel with gdb make this undesirable. For example, after stopping execution at a breakpoint there are bugs that prevent the execution from continuing further. This renders breakpoints useless and one is left with only single-stepping, which severely limits debugging capability. Also, Angel does not support multi-threaded debugging (the ability to control execution on a thread-by-thread basis). So I chose to port the GDB stubs and use them instead. However, the Angel source code is a valuable reference when writing the GDB stubs (much of the early boot code is the same). Since GDB stubs are a subset of eCos (they both use the same HAL), the process of compiling them is very similar to compiling an eCos kernel. Section 2.1.1 talks about how the Brutus HAL code was prepared for configuration and compilation in the eCos configuration tool. Section 2.1.2 goes on to discuss how the various functions and macros in the HAL were ported (as we discussed earlier in this section). Then, Section 2.1.3 gives an overview of the boot sequence, which is the final and most difficult part of the HAL porting. Section 2.1.4 and Section 2.1.5 describe how the SA- 1100's virtual memory system is initialized at boot time. 12. It is the necessity of this separate downloading step that qualifies eCos as an embedded operating system. Since embedded computers often have no text displays or keyboards, all communication and debugging must be done from a remote host. 28 2.1.1 Copying the SA-1 10 HAL as a Starting Point Since Cygnus had already created an eCos HAL for the EBSA285 (the SA- 110/21285 evaluation board), I chose to use this as a starting point for my Brutus/SA- 1100 HAL. There is enough similarity between the SA- 110 and SA- 1100 that this saved some work. The shaded areas in Figure 2.1 show the features of the SA1100 that are identical to the SA- 110. Unshaded areas are units that either did not exist in the SA-1 10, or were significantly changed, or used to be part of the 21285 companion chip but were integrated onto the SA1100. Any code that accesses features in the unshaded areas of the SA- 1100 in this figure needs to be ported or completely re-written. 1. Read Buffer 8KB Data Cache I 512-byte MiniDcache nGeneral-Purpose Interrupt Controller / Memory/ Controller DMA Controller Serial Controllers LCD Controller Interval Timer Real-Time Clock (from Intel SA-1 100 Developer's Manual) Figure 2.1: SA-1100 vs. SA-110 29 rn eCos HAL u... ......... True ....T....r ....TrueCs.lbrb C library Math library True True - mmon error code support WA"subsysdemc 1~Waiclock E6 device Watchdog device CygMon ROM monitor True T True True True Type Value Defauk Value Macro Fie Boolean True True CYGPKGKERNEL Defined at line URL Vu is required by file://C-\Program Files\ygnus Solutions\eCos\doc\ref* Vrequires CGFUNHALCOMMON KERNELSUPPORT is required by dby is re is required by V is required by . is required by is required by Vis required by CGFUN.HALCOMMONERNELSUPPOR T CYGPKGUITRON CGPKLIBCMALLOC CYGSEM LIBC STDIO THREADSAFESTREAMS CYGSEMLIBC SIGNALS THREAD SAFE CYGSEMLUBCSTARTUP MAINTHREAD CYGSEM LIBCEXaTSTOPSSYSTEM ,CYGPKGsYGMON This package contains the core functionality of the eCos kernel. It relies on unctionality provided by various HAL packages and by the eCos nrastructure. In turn the eCos kernel provides support for other packages such as the device drivers and the uITRON compaibilty layer. Figure 2.2: eCos Configuration Tool user interface to the actual eCos The eCos configuration tool, seen in Figure 2.2, provides a graphical or disable entire sections of the code source code. At the top level (seen in the figure), the user can enable configurations within each package. The called packages. At lower levels, the user can perform fine-grain it by looking at the source code hierarchical menu is not hard-coded into the tool; the tool actually generates the tool copies the source code for the repository. Once a specific set of configurations have been chosen, and the configurations are writenabled packages into a build tree (separate from the source code repository) C macros. The rest of the eCos source ten into the build tree in the form of several header files containing configured. code includes these header files and uses the macros to behave as /hal/arm/ebsa285/) After copying all the files of the EBSA285 HAL ($ECOSSRC /packages ($ECOSSRC/packages /hal /arm/ and doing string replacements to create a distinct Brutus HAL possible to build an eCos kernel with the brutus /) in the eCos source code repository ($ECOSSRC), it is new Brutus HAL using the configuration tool (Figure 2.3). 30 D- I eCos HAL Enumeration 206400 Delault Value 206400 CYGHWR HAL ARM BRUTUS PROCESSOR CLOCK Macro - CAProgram Files\Cygnus Solutions\eCos\packages\hal\a File Detined athlne~154fie-//C-\Program Files\Cygnus Solutions\eCos\doc\ref\ec URL Type QI Platform-independent HAL options El L Source-level debugging support El (CARM architecture Provide.diagnostic dump for exceptions FProcess all exceptions with the eCos application Support GDB thead operations via ICE/MultCE [E 0 ARM PID evaluation board& El C ARM AEB-1 evaluation board Citrus Logic EDB7pXX evaluation boards DT E C lntel EBSA285 StrongARM evaluation boards E 0 Intel StrongARM 1100 evaluation boards" uN Startup type serial port serial port baud rate Value False False son ... as .......... guena. . xpressed in KHz. ram 0 38400 0 port baud rate seNal device divers er Systems CMA230 board 38400 206400 False Figure 2.3: eCos Configuration for Brutus Platform 2.1.2 Porting of Self-Contained Functions and Macros The first file to be ported is: $ECOS_SRC/packages/hal/arm/brutus/v-2_10/include/hal-brutus .h because it contains a large number of one-line macros describing the register locations of the SA- 1100 that are used throughout the rest of the HAL code. The register locations in this file are copied from the SA- 1100 developer's manual, as are several special bitmask macros for some of registers. Browsing these macros is a good way to become familiar with the SA- 1100's features. For example, here are the definitions of the macros for the real time clock (RTC). The register names correspond to the names used in the SA- 1100 manual: /* SA-1100 Internal Registers for System Control Module Real-Time Clock Definitions */ #define #define #define #define /* RTSR REG32 REG32 REG32 REG32 SA1100_REGRCNR SA1100_REGRTAR SA1100-REGRTSR SA1100_REGRTTR */ #define SA1100_ALARMDETECTED #define #define #define /* RTTR SA1100_1HZRISINGEDGEDETECTED SA1100_ALARMINTERRUPTENABLE SA1100_1HZINTERRUPTENABLE */ #define SA1100_CLOCKDIVIDERCOUNTMASK #define SA1100_TRIMDELETE_COUNTMASK 31 _PTR(0x90010004) _PTR(0x90010000) _PTR(0x90010010) _PTR(0x90010008) Ox1 0x2 0x4 0x8 Ox000 )FFFF OxO3F FOOOO Once these basic macros are defined, several larger (but still small) macros and functions can be ported. For example, below are the macros used by the eCos code to mask or unmask interrupts. These are defined in $ECOSSRC/packages/hal/arm/brutus/vl_2_10/src/brutusmisc.c: Original (for SA-110): void haljinterrupt-mask(int vector) *SA110_IRQCONT_IRQENABLECLEAR = 1 << vector; void haljinterrupt-unmask(int vector) { *SA110_IRQCONT_IRQENABLESET = 1 << vector; Ported (for SA-1 100): void hal interruptjmask(int *SA1100_REGICMR &= -(1 vector) { << vector); void haljinterrupt-unmask(int vector) { *SA1100_REGICMR I= (1 << vector); This example is representative of the type of editing that needs to be done in a large number of macros and functions throughout the HAL in the following files (this is not necessarily an exhaustive list): $ECOS_SRC/packages/hal/arm/brutus/v1_2_10/include/halcache.h $ECOSSRC/packages/hal/arm/brutus/v2_10/include/halplatformints .h $ECS_SRC/packages/hal/arm/brutus/v12_10/include/hal_diag.h $ECOSSRC/packages/hal/arm/brutus/v12_10/include/pkgconf/halarmbrutus $ECOSSRC/packages/hal/arm/brutus/vl_2_10/include/plf-stub.h $ECOSSRC/packages/hal/arm/brutus/v1_2_10/src/plf_stub.c $ECOSSRC/packages/hal/arm/brutus/v1_2_10/src/hal_diag.c $ECOS_SRC/packages/hal/arm/brutus/v1_2_10/src/brutusmisc .c $ECOSSRC/packages/io/serial/v1_2_10/include/pkgconf/io-serial.h $ECOSSRC/packages/io/serial/vl_2_10/src/arm/brutusserial .c .h 2.1.3 Overview of the Boot Sequences The most complex part of the Brutus HAL is the initialization code. Fortunately, the main flow of the boot sequence is the same for all ARM processors and thus did not need to be ported. The file that contains the code that runs when the processor boots is: $ECOSSRC/packages/hal/arm/arch/v1_2_10/src/vectors.S and this file did not need to be edited at all. However, the very first thing this code does is call a PLATFORMSETUP macro. This macro, which is for platform specific initialization, is defined in: $ECOSSRC/packages/hal/arm/brutus/v1_2_10/include/hal-platformsetup.h and this file required a complete rewrite. Before we discuss any details of boot-time initialization, it would be good to overview the list of actions that are performed when eCos boots. The boot sequence is different depending on whether eCos is running 32 from ROM or RAM. The boot sequence for the GDB stubs is similar to ROM startup, but there are differences. During the boot sequence, the Brutus HEX LED display is used to display numbers which indicate visibly which part of the boot process is currently executing. If the SA- 1100 crashes or hangs during the boot sequence, the user can use the value on the HEX display for clues to debug the problem. The LED value at the end of each boot sequence is zero. Normally, the boot sequence occurs so quickly that you see nothing but a zero on the display. If you see any other number, you know immediately that the corresponding step has failed. Table 2.1, Table 2.2, and Table 2.3 list the actions performed as part of the boot sequence for each startup type (RAM, ROM, STUBS), along with the LED values that are displayed for each. Note that some details have been left out from these tables. The best way to discover the complete details of what happens when eCos boots is to read the source code directly. There are a couple of details of the boot sequences that exist for energy-efficiency, as is recommended by the SA- 1100 manual. First, notice that the instruction cache is enabled early (even before the MMU is enabled), in order to allow the boot code to run much more efficiently. This means that later, when the MMU is enabled, the ICACHE needs to be temporarily disabled, flushed, and re-enabled. Second, the clock frequency is set and clock switching (a feature that allows the core clock to double in frequency relative to the memory clock) is enabled early in the boot sequence, in order for the boot code to run at the desired frequency (which can be chosen to trade off boot time with boot energy). 33 Next Boot Action Performed Readout 8 If already in supervisor mode, jumps to step 5 (below). 7 Sets up exception vectors for undefined instruction and software interrupt exceptions. 6 Switches to supervisor mode. 5 Sets up exception vectors for IRQ, FIQ, prefetch abort, data abort. 4 Initializes stack pointers. Initializes CPSR, SPSR. Clears BSS. 3 Platform specific hardware initialization in halhardware inito. For Brutus, this sets up the interrupt environment by masking all interrupts and setting them all to do IRQ and not FIQ. 2 (Nothing) 1 Invokes static constructors (for all the C++ code of eCos). 0 Starts the eCos kernel by calling cyg-starto. Table 2.1: Boot Sequence for RAM Startup 34 Hex LED Readout (blank) Next Boot Action Performed Enters SVC mode, sets frequency, flushes caches. Initializes HEX LED display. F Enables instruction cache. E Initializes peripheral pins and GPIOs. Clears OS timer count register. D Sets clock frequency and enables clock switching. C Initializes memory interfaces (DRAM waveforms, ROM type, etc.). B Builds virtual memory page tables. A Disables domain access control. Sets page table base address register. 9 Enables MMU and caches. 8 (Nothing) 7 Sets up exception vectors for undefined instruction and software interrupt. 6 Makes sure we are in supervisor mode. 5 Sets up exception vectors for IRQ, FIQ, prefetch abort, data abort. 4 Sets up reset exception vector (for warm reset). Relocates data from ROM to RAM. Initializes stacks. Initializes CPSR and SPSR. Clears BSS. 3 Platform specific hardware initialization in halhardwareinito. For Brutus, this sets up the interrupt environment by masking all interrupts and setting them all to do IRQ and not FIQ. 2 (Nothing) 1 Invokes static constructors (for all the C++ code of eCos). 0 Starts the eCos kernel by calling cyg-starto. Table 2.2: Boot Sequence for ROM Startup 35 Hex LED Readout (blank) Next Boot Action Performed Enters SVC mode, sets frequency, flushes caches. Initializes HEX LED display. F Enables instruction cache. E Initializes peripheral pins and GPIOs. Clears OS timer count register. D Sets clock frequency and enables clock switching. C Initializes memory interfaces (DRAM waveforms, ROM type, etc.). B Builds virtual memory page tables. A Disables domain access control. Sets page table base address register. 9 Enables MMU and caches. 8 (Nothing) 7 (Nothing) 6 Makes sure we are in supervisor mode. 5 Sets up exception vectors for software interrupt, IRQ, FIQ, prefetch abort, data abort. 4 Sets up reset exception vector (for warm reset). Relocates data from ROM to RAM. Initializes stacks. Initializes CPSR and SPSR. Clears BSS. 3 Platform specific hardware initialization in halhardware inito. For Brutus, this sets up the interrupt environment by masking all interrupts and setting them all to do IRQ and not FIQ. 2 Initializes stubs (initializes serial port, etc.). 1 Invokes static constructors (for all the C++ code of eCos). 0 Starts the eCos kernel by calling cyg-starto. Table 2.3: Boot Sequence for STUBS Startup 2.1.4 Building the Virtual Memory Page Tables One step of the boot sequence implemented by the PLATFORMSETUP macro (mentioned in the previous section) is to build the virtual memory page tables. The use of virtual memory allows for flexibility in defining the memory layout. Implementing and debugging the code that builds the page tables and enables the MMU (next section) is one of the more difficult tasks of porting eCos to the Brutus. 36 The layout of physical memory, shown in Figure 2.4, is already determined by the SA- 1100 platform and cannot be altered: Reserved (384 Mbyte) OhCOO0 0000 Zeros Bank (128 Mbyte) DRAM Bank 3 (128 Mbyte) Cache flush replacement data Reads return zero 128 Mbyte DRAM Bank 2 (128 Mbyte) Dynamic Memory DRAM Bank 1 (128 Mbyte) 512 Mbyte DRAM Bank 0 (128 Mbyte) LCD and DMA Registers (256 Mbyte) Oh8000 0000 Memory and Expansion Registers (256 Mbyte: Internal Registers System Control Module Registers(256 Mbyte 1GB Peripheral Module Registers (256 Mbyte) Reserved (1GB) Oh4000 0000 PCMCIA Socket 0 Space (256 Mbyte) PCMCIA Interface 512 Mbyte PCMCIA Socket 1 Space (256 Mbyte) 0h2000 0000 Static Bank Select 3 (128 Mbyte) Static Bank Select 2 (128 Mbyte) Static Bank Select 1 (128 Mbyte) OhOO0O 0000 Static Memory 52 Flash, SRAM) Static Bank Select 0 (128 Mbyte) (from Intel SA- 1100 Developer's Manual) Figure 2.4: SA- 1100 Memory Map for the entire 4Gb 32-bit Address Space Note that a large portion of the address space is devoted to the "Internal Registers." When addresses in this range are used, the SA- 1100 routes the data to or from special internal registers instead of to off-chip memory. This is the mechanism by which most of the features of the chip are accessed and controlled. Also notice that ROM is found at address 0x00000000, while DRAM begins at address 0xC 0 0 0 0 0 0 0. Although the SA- 1100 is not the only processor with a memory layout like this, it is somewhat uncommon to have ROM at address zero. The reason is because the ARM architecture (as well as many others) requires the exception vectors to begin at address zero, and generally the exception vectors need to be in RAM so that software can exchange exception handlers at runtime. Since the SA- 1100 has ROM at 37 address zero, if you want exception vectors to be in RAM you will have to enable the MMU and create a non-flat 13 virtual memory mapping. The ARM architecture defines the page tables and other details of the virtual memory system (these are not specific to the SA-1 100). A two-level page table scheme is used. The LI page table occupies 16Kb of memory, contains 4096 entries, and each entry governs a 4Gb). The L2 page tables each occupy the address space (for a total of 1Mb section of the address space (for a total of 1Kb of memory, contain 256 entries, and each entry governs 4Kb of 1Mb). There is also a special "sub-page" feature that allows access control (but not address remapping) to extend to a resolution of 1Kb. As with any multi-level page table scheme, the use of the lower level page tables is optional. Thus, I chose to avoid the use of L2 page tables because they waste memory and make the each page table walk (which is implemented in hardware) less efficient. By using only an LI page table, several "holes" are left in the address space where the latter part of a range of addresses (that is less than the 1Mb resolution, such as the 256K of boot ROM) has a mapping in the virtual memory system but has no underlying physical memory (and thus accesses to these addresses could be unpredictable). This is a sacrifice that I chose to make for the sake of energy efficiency. 14 13. By "non-flat," I simply mean that the predicate (virtual address == physical address) does not hold true for all locations in memory, and thus address translation is actually necessary. 14. I make no claim here as to how much this effects energy efficiency because I did not perform actual experiments to measure this. 38 Table 2.4 shows the memory layout that I chose to use for eCos on the Brutus: Purpose of Size Physical Address Range Virtual Address Range Boot ROM 1 Mb OxOOOOOOOO. .OxOOOFFFFF 0x04000000. .OxO40FFFFF Peripheral Control Module (PCM) Registers 1 Mb 0x80000000. .Ox80OFFFFF 0x80000000. .Ox80OFFFFF System Control Module (SCM) Registers 1 Mb 0x90000000. .Ox900FFFFF 0x90000000. .Ox900FFFFF Memory Control Registers 1 Mb OxAOOOOOOO. .OxAOOFFFFF 0xAO000000. .OxA0OFFFFF DMA/LCD Registers 2 Mb 0xB0000000. 0xB01FFFFF 0xB0000000. .0xB01FFFFF DRAMBank1 4Mb OxCO00000..0xCO3FFFFF 0x00000000..OxO03FFFFF DRAM Bank 2 4 Mb 0xC8000000. .OxC83FFFFF 0x00400000. .OxO07FFFFF DRAM Bank 3 4 Mb OxDOOOOOOO. .OxDO3FFFFF 0x00800000. .Ox0OBFFFFF DRAM Bank 4 4 Mb OxD8000000. .OxD83FFFFF OxOOCOOOOO. .OxOOFFFFFF OxE000000. OxEOOO OOOO. .OxE80FFFFF Memory Area Zeros Bank 128 Mb . .OxE80FFFFF Table 2.4: Virtual Memory Mapping as specified in Page Tables RAM has been remapped to address zero, while ROM has been moved up to address 0x0 40 00 0 0 0.15 The rest of the address space has a flat mapping to avoid confusion. No mapping is created for SRAM or Flash since they are not used at this time. 16 Any reads or writes to virtual addresses outside any of the ranges in the table will produce page faults. The access permissions and cacheablility are not shown in the table, but DRAM Bank 4 is mapped uncacheable and unbufferable, so that it can be used for things such as the LCD frame buffer (which is required to be uncacheable and unbufferable). This memory mapping is one of the few things that will need to be adjusted when eCos is finally ported from the Brutus to the ApAMPS prototype. Once the memory map is decided, eCos needs to be configured so that the linker knows where to place code and data when compiling eCos. Normally, this would require the direct use of a linker script, which is 15. This location for ROM will be okay as long as there is less than 64Mb of DRAM (otherwise the DRAM starting at address zero will occupy space beyond 0x040 0 0 0 0 0). 16. Before running eCos, it is important that the Brutus switches be set to enable DRAM instead of SRAM, and to enable 32-bit wide ROM accesses. 39 beyond the level of many programmers. Therefore the eCos configuration tool has a graphical feature to aid in the automatic generation of a linker script (and a couple of other related files). Figure 2.5 on page 41 and Figure 2.6 on page 42 show the memory layouts as they were defined in the eCos configuration tool for STUBS, RAM, and ROM startup. You can see in Figure 2.5 that the GDB stub code resides in ROM starting at address 0x0400 0 0 0 0 and can use the entire 256Kb range of ROM if nec- essary, while (writable) data is limited to the lower 16Kb of RAM. The remaining part of RAM is reserved for downloading code that is to be run from RAM. Notice that the data section appears in both the RAM and ROM regions in the layout. This is because this section contains initialized data and is copied (relocated) from ROM to RAM at boot time. The 800Kb reserved section at the bottom of RAM to leave room for the exception vectors. Looking at Figure 2.6, we see that eCos programs compiled for RAM startup only use the first three banks of DRAM (as mentioned earlier, the fourth bank is mapped uncacheable and unbufferable and is for special uses such as the LCD buffer). The 32Kb of reserved space in the lower part of DRAM is for the GDB stubs and the virtual memory page tables (which reside at OxO 0004000). Programs compiled for ROM startup use RAM very similarly, and use ROM in the same way it was used for STUBS startup (except now it is much more likely that the entire 256Kb range would be needed). 40 1* "2'' C S 16 -D 0 2 CD ii 2 LD 0D 0D C3 O) 8 0D CD E 2 Figure 2.5: Memory Layout for STUBS Startup 41 1 Cu C 2 C 0b 8~ *1 'N 'N ii 'U 2 La' Ll. 0 C 8 8 C 00. 8 8C) 0 8 -C C CD 0 CD *3 Figure 2.6: Memory Layouts for RAM and ROM Startup 42 -C 2.1.5 Enabling the MMU Once the page tables have been created, the next step is to enable virtual memory using them. Enabling virtual memory should be as simple as setting the page table base address register to the physical address of the page table (OxC004000) and turning the MMU on. However, when the memory mapping is not flat (for the region of code that actually enables the MMU) things get complicated because the address of the next instruction "magically" changes when the MMU is enabled (which is effectively like a jump instruction). On some pipelined architectures, it is possible to place a branch in exactly the right position so that execution will continue seamlessly. However, this does not work on the SA- 1100, so an interesting "hack" must be performed instead. Figure 2.7 illustrates the steps of the hack that occur prior to the activation of the MMU, and Figure 2.8 shows the steps that happen afterward: Virtual Address Space Step 1: Physical Address Space DRAM:OXCOO##### ROM: Ox040##### DRAM: OXOOO0### ROM: OX0 0 ##### Virtual Physical Program Counter I x000##### Step 2: DRAM: OxCOO##### ROM: OxO40 #####I DRAM: OxOOO ##### PC IOx00 0# ####] Figure 2.7: Before Enabling the MMU 43 ROM: OXOOO ##### Physia Virtual Step 3: DRAM: OxCOO##### ROM: Ox040##### PC _______ ________ DRAM:Ox0 0##### x000##### Physical Virtual Step 4: ROM:OxOOO##### DRAM: OxCOO##### ROM:Ox04O##### PC DRAM:OxOOO##### 0x040##### Physical Virtual Step 5: ROM:OxOOO##### DRAM:OXCOO##### ROM : 0x040##### 0x040##### DRAM:Ox000##### ROM:OxOOO##### Figure 2.8: After Enabling the MMU In Step 1, the page tables have been built normally (DRAM mapped to OxO 0 0 00 0 00, ROM mapped to OxO 40 0 0 00 0, etc.) and the program counter points directly to the code that is executing from ROM. If we enabled the MMU in this condition, the next instruction would be fetched from a nonsensical location in DRAM because the address in the program counter would suddenly be mapped there. To prevent this from happening, Step 2 temporarily overwrites the page table entry for the corresponding page in DRAM so that it points to the page in ROM where the code is executing. Since the memory mapping for this page is now flat, the MMU can be safely enabled. In Step 3, the MMU is enabled and code is still running from the same place in ROM. We now need to update the PC, however, so that the code will be running at its new virtual address. This is done by a branch instruction (Step 4) that sends the PC to the next instruction at its correct virtual address. Now we need to 44 restore the page table entry that was temporarily overwritten so that we can again access the first page of DRAM. This is done in Step 5, and now things are back to normal with the MMU enabled. There is one detail that was glossed over, however. Although the steps as described would work in general, there is a complication as we have defined the memory map for eCos because the page table itself is in the first page of DRAM - the very area of memory that becomes temporarily unavailable in step 2. The page table must have a mapping in the virtual address space, however, in order for us to restore the page table entry in step 5. I accomplished this by creating an alias to the first page of DRAM in the page table. The very last page table entry (which was otherwise unused) points to the first page in DRAM where the page tables are stored. When the temporarily overwritten page table entry is restored in step 5, the code writes to the page table at the address OxFFF04 000, which points to the correct place in DRAM. By creating this alias to the page in memory where the page tables are stored, we are assured to always be able to write to them if necessary. Some subtle points needs to be mentioned. First, the TLBs must be flushed each time the page tables are edited, as is done in steps 2 and 5. Second, the caches are temporarily disabled during all of these steps to prevent aliases from being created in them after the MMU is enabled. Finally, note that any registers that contain addresses must be updated when the MMU is enabled. For example, if the link register (LR) contained a return address, it will need to be translated before returning. If the MMU ever needed to be disabled, there would be a similar set of steps to follow. However, with eCos on the Brutus it is never necessary to disable the MMU. 2.2 The Debugging Process Once all of the code in the HAL was finally ported to the Brutus as described in the various parts of Section 2.1, the next step was to compile GDB stubs and begin debugging the code in ROM. The debugging of the early parts of the boot sequence was the hardest because the only feedback available was from the HEX LED display (serial ports are not initialized until late in the boot sequence). To debug the code that performs the steps described in Section 2.1.5, for example, I actually had to write code to display the values in the SA1100 registers one digit at a time. Once the GDB stubs are working, it is possible to connect to the Brutus with gdb and download code just as would be done with Angel. The next step is to compile a full eCos kernel and test an eCos application. Since eCos comes with over 150 test programs, the existing test suite was used to verify the functionality of 45 eCos on the Brutus. Initially, most of the tests passed. Tests that failed were mostly due to a bug in the floating point code that gcc generates to pack or unpack doubles. 46 Chapter 3 eCos Applications 3.1 Early Demo Programs To verify that eCos was working reasonably well on the Brutus, and to get some more experience writing eCos applications, a few simple demo programs were written. My earliest test programs print messages to the serial port, which are then displayed in the gdb terminal on the PC. It wasn't long, however, before we wanted to be able to interact with the Brutus directly using its own peripherals. I ported the LCD and keyboard drivers from some of Intel's code (that was written to run on Angel), and used these to write a series of graphical demos. One of the demos, for example, displays colored bouncing squares on the LCD screen. Each square is animated by a separate thread. A screen shot of this program is shown in Figure F.4 on page 72. The next step before implementing DVS was to write a test program to allow the frequency of the SA1100 to be changed at runtime. I wrote this program as an extension of the graphical colored squares demo so that any runtime errors would be visibly obvious. The keyboard is used to control frequency as desired. The SA- 1100 manual actually recommends that clock frequency only be set at boot time. However, the SA- 1100 clock frequency can actually be safely changed at any time, as long as certain peripherals are not in use (because the clock signal to some peripherals such as serial ports is unstable during the 150us that it takes for the PLL to re-lock). One issue that surfaced in developing this program is that some peripherals need to be re-configured when the frequency is changed. For example, the LCD pixel clock is derived from the core clock using a configurable divider. When the core clock frequency changes, the divisor must be updated in order to maintain the same LCD refresh rate. 3.2 Developing a DVS Demo Now that frequency scaling worked, I was ready to begin developing a program to do actual Dynamic Voltage Scaling. The first part that needed to be developed was code to determine the processor load. The average load value over each interval is what is used (in the simplest implementation of DVS) to determine what frequency to use in the next interval (in this case, intervals are on the order of a second or two). The method this program uses for determining load is to create a thread of lowest priority (so that it only runs when nothing else is ready to run) with an infinite loop. A counter is incremented inside the loop, so that the counter 47 value can be used to determine roughly how much time the processor has spent running the lowest priority thread. A load-monitoring program implemented in this manner was posted to the ecos -discuss mailing list (by Andrew Lunn, an eCos user from Switzerland). This load monitoring code has a cleverly written calibration step which temporarily makes the counter thread highest priority. The counter value after one second at highest priority is used as the 100% load reference point. One problem with this method for determining processor load is that it requires a loop to be running when the application would otherwise be idle. Since the SA-1100 idle mode consumes significantly less energy, it would be nice to be able to use it. An alternate scheme for determining processor load (which has not yet been implemented) would be to make the lowest priority thread simply idle the processor, and check the value of the OS timer count register upon entering and exiting idle mode. By subtracting the two values it can be determined how much time the processor spent idling. Since the OS timer counts at a known frequency (32.7KHz), a calibration step would not be required. The frequency scaling and load monitoring code was enough to create a dynamic frequency scaling demo. I was able to write a program which adjusted the clock frequency depending on the processor utilization of the application. The program tried to keep the load at a certain value (around 95%) by raising or lowering the frequency. Since the application was performing very predictable periodic tasks, this demo worked well. The only remaining feature needed to complete a true DVS demo was now the voltage scaling code. To determine which voltage to use at each processor frequency, the DVS Test Board (shown in Figure F.2 on page 71) was hooked up to the Brutus and the experiment shown in Figure 3.1 was done. This experiment shows the energy consumed at each frequency and voltage combination (a nice illustration of why we want to do DVS in the first place), but also shows the lowest voltages at which the processor can run at each frequency without crashing. A small safety margin was added to these voltages and the values given in Table 3.1 were decided upon. 48 C0.28 09. 01. 176. 25.4.0.0 Core Voltage (V) Frequency (MHz) Figure 3.1: Measured Energy per Operation vs. Frequency and Supply Voltage Core Frequency (MHz) Core Supply (V) D4-0 Hex) 206.4 1.500 OA 191.7 1.400 OC 176.9 1.350 OD 162.2 1.275 10 147.5 1.225 12 132.7 1.175 14 118.0 1.100 17 103.2 1.025 1A 88.5 0.975 iC 73.7 0.925 lE 59.0 0.900 IF Table 3.1: Voltages used with DVS for each Clock Frequency The test application that was chosen to demonstrate DVS is variable-length filtering. The DVS demo program performs FIR filtering of an input signal using simple convolution. By varying the length of the FIR filter, differing amounts of computation are required to perform the filtering and the output signal quality 49 varies. Figure 3.2 shows how the resulting energy (per instruction) is affected by varying filter length/quality. For this application, DVS provides a 60% energy savings over frequency scaling alone. I I I I I 0 0 1- 0. 0 0 0 - 0. C 0:7 0. 0 00 0 Z5 0.1 00 4- 0.5 E 0 -A 0 00 0. z0. 30. 20. 1- AA 0 20 40 60 140 120 100 80 Filter Quality (Impulse Response Length) 160 180 200 Figure 3.2: Energy per Operation with Voltage Scaling vs. without Voltage Scaling Figure 3.3: Screen Shot from DVS Demo 50 Chapter 4 Ideas for Energy Efficiency Improvements Although my experiments showed that even a simple implementation of DVS can give good results, this work is only a beginning. More energy efficient techniques would need to be used to better meet the needs of the pAMPS project. This chapter explores several ideas, some better than others. 4.1 DVS-Related Techniques 4.1.1 Choose the Right Boot Frequency It was alluded to earlier (in Chapter 2) that the clock frequency and voltage that are used during system initialization can be traded off with the time the initialization takes. If initialization time is not critical and other parts of the system are not burning significant amounts of energy during this time, the processor should be kept at the lowest possible frequency (and voltage) while booting. However, in other cases is may be ideal to boot at higher frequencies. This is a trade-off that should be considered once more details of a particular ptAMPS application become clear. This technique is only useful in applications where the nodes are for some reason frequently rebooted (recall that the SA- 1100 must reboot when coming out of sleep mode). 4.1.2 Energy vs. Quality Scaling The OS can help DVS spend more time at lower voltages by appropriately varying the quality of certain features. For example, software floating point operations can be done with varying levels of accuracy as required by the application. 4.1.3 Thread-based Voltage Scheduling The simple interval-based DVS technique used in the experiment of Chapter 3 will perform poorly for certain applications with irregular load patterns. It has been suggested ([22]-[25]) that using information about the execution patterns of individual threads could provide better input for voltage scheduling algorithms. Perhaps the voltage scheduling could actually be integrated into the OS scheduler. Unfortunately, I did not have time to further develop this idea. 4.2 Non-DVS Techniques 4.2.1 Know your Hardware There is no substitute for understanding the details of the hardware that effect energy efficiency. Examples of features significant for energy efficiency on the SA- 1100 are: 51 * Be careful not to let any of the SA-1 100 GPIO pins float (see page 11-184 of the SA-1100 manual). Floating pins can cause unnecessary transitions to occur in the pads of the SA-1100, which are powered by the 3.3V supply and thus can expend significant amounts of energy. All GPIOs should either be configured as outputs, or driven by the devices at the other end. Since reset state of the SA-1 100 is for all GPIOs to be inputs (to avoid contention), the boot sequence should configure the correct pins as outputs as soon as possible. - Use FIQs when appropriate. Avoiding the unnecessary saving and restoring of registers (necessary for IRQs) can be good both for performance and for energy efficiency. * In the special case of a system that uses an LCD panel that does not require the AC bias signal from the SA-1100's LCD controller, the AC bias level should be set to the minimum value to save energy. - Make sure the DRAM waveform configuration registers are set up properly. The detailed timing of the signals between the SA- 1100 and DRAM can effect energy on the memory bus. There may be different choices of timing configurations that work equally well, but some may be more energy efficient than others. * Use DMA-driven (instead of interrupt-driven) I/O when appropriate. Not only does this prevent the ARM core from doing extra work, but also the DMA controller can do more efficient block transfers, saving energy both on the SA-1 100 and also in the external memory system. 4.2.2 Efficient/Restricted use of Memory In some systems the DRAM consumes a large percentage of overall power. Special attention to the use of memory by the OS can effect this. For example, the memory map that is chosen will effect energy consumption in the TLBs and elsewhere. If half of the time the SA- 1100 is accessing memory addresses in the range OxFFFF#### and the other half of the time it is accessing memory in the range OxO000####, there will be a lot of transitions occurring in the TLBs and Caches because of the differing upper address bits. Defining the memory map so that the two ranges are adjacent might save energy. Keeping the memory map relatively flat can also help avoid the use of L2 page tables, which take more memory and make page table walks take more energy. Details in how memory allocation is performed will also make a difference in the memory address access patterns that could effect energy consumption. 52 Chapter 5 Conclusion Although I have laid the groundwork for the pIAMPS OS, there is plenty of work left to do. Future work might include: - More research on DVS. - pAMPS application fine-tuning: Do benchmarking on the actual pAMPS system and fine-tune OS and application for best energy efficiency. " Implement an ARMulator model for the pAMPS board to allow energy-true simulation of software. - Compiler enhancements for energy efficiency, such as using the Thumb instruction set in addition to ARM instructions for code compactness (Thumb is not supported on the StrongARM 1100 but another ARM chip could be used). - Consider how to make higher-level OS's, such as Linux, more energy efficient (file systems, virtual memory paging, networking, dynamic memory allocation, process handling, parallel/distributed processing, etc.). I am glad to have had this opportunity to learn and grow in several areas before my MIT education is drawn to a close. Through the course of writing this thesis I have been introduced to real-time operating systems, embedded software development, the ARM architecture, and more details of the GNU toolchain and C programming. The pAMPS project now has a small OS to use on its prototype hardware and to begin improving upon. We also now have some real experience with the energy savings that can be achieved by DVS. This project has been challenging and even overwhelming at times, but the experience is valuable and will perhaps help future projects go more smoothly. 53 54 Appendix A DVS Test Board Specifications This appendix gives some specifications for the DVS Test Board (see Figure F.2 on page 71) that was designed by Rex Min for use with the Brutus. A.1 Control Signals from Brutus to DVS Test Board We used 5 GPIO signals to control the voltage level. Fortunately, the Brutus has exactly 5 GPIO pins connected to easily accessible test points. Wires were soldered to these test points for connection to the DVS test board. Table A. 1 shows how the control wires were connected: Test Point DVS Board Input GPIO Also Functions As TP16 D4 9 left green LED TP17 D3 8 right green LED TP93 D2 20 red LED TP25 DI 26 RCLK out TP26 DO 27 32KHz Out Table A.1: Voltage Control Signals A.2 Specification for DVS Test Board Inputs Table A.2 shows which values of the DVS board inputs correspond to which voltage output levels. This information was used to write the code that allows that Brutus to set the voltage. There is more than one value that produces 1.250 V, but only one of them is used by the software to produce this voltage level. If inputs D4 or D3 to DVS test board are left floating, the board will take the value from its switches instead of from the control wires. The board was designed to honor the SA-1 100's maximum safe voltage of 1.6 V, even though the LTC 1736 regulator chip that was used can output up to 2 V. 55 D4-DO (Hex) D4-DO (Decimal) D4-DO (Binary) Output Voltage (to SA-1100 00 0 00000 S 01 1 00001 S 02 2 00010 S 03 3 00011 S 04 4 00100 S 05 5 00101 S 06 6 00110 S 07 7 00111 S 08 8 01000 1.600 09 9 01001 1.550 OA 10 01010 1.500 0B 11 01011 1.450 0C 12 01100 1.400 OD 13 01101 1.350 OE 14 01110 1.300 OF 15 01111 1.250* 10 16 10000 1.275 11 17 10001 1.250 12 18 10010 1.225 13 19 10011 1.200 14 20 10100 1.175 15 21 10101 1.150 16 22 10110 1.125 17 23 10111 1.100 18 24 11000 1.075 19 25 11001 1.050 1A 26 11010 1.025 1B 27 11011 1.000 iC 28 11100 0.975 ID 29 11101 0.950 1E 30 11110 0.925 IF 31 11111 0.900 Table A.2: DVS Test Board Voltage Control Signals 56 A.3 AC Characteristics Figure A. 1 shows a scope plot of the output voltage of the DVS board (while driving the Brutus) as the voltage is changed between various levels. Zooming in revealed that the rising and falling transients are less than 100ps is all cases. This is roughly the same amount of time that the SA- 1100 takes to re-lock its PLL when changing clock frequency (150ps). Because these times are on the same order of magnitude, it is not necessary for the DVS software to introduce a waiting period between the adjustment of voltage and the adjustment of frequency. 1 .3 I I I I I I 5 10 15 20 I I I I I )1. 0.6 0 I I 25 30 35 40 Seconds Figure A.1: Transients in output voltage of DVS Board 57 I 45 50 58 Appendix B eCos Regression Test Results Table B. 1 is a list of the regression test programs that are included with eCos 1.3.1. For each test, the results after running it on the debugged Brutus port is indicated. Note that the serial driver still contains a bug that will need to be fixed later, and thus a few tests failed. Result P P P N/A P FAIL FAIL FAIL FAIL FAIL FAIL P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P Test Result P hal/arm/brutus/v 1_3_1/tests/dram-test hal/common/v1_3_1/tests/cache hal/common/v1_3_1/tests/context hal/common/v1_3_1/tests/intr io/serial/v 1_3_1/tests/seriall io/serial/v 1_3_1/tests/serial2 io/serial/v 1_3_1/tests/serial3 io/serial/v1_3_1/tests/serial4 io/serial/v 1_3_1/tests/serial5 io/serial/v 1_3_1/tests/ttyl io/serial/v 1_3 1/tests/tty2 kernel/v 1_3 1/tests/bin semO kernel/v 1_3_1/tests/bin semi kernel/v 13_1/tests/bin sem2 kernel/v 1_3_1/tests/clockO kernel/v 1_3_1/tests/clockl kernel/v 1_3_1/tests/clockcnv kernel/v 1_3_1/tests/cnt semO kernel/v 1_3_1/tests/cnt semi kernel/v 1 3_1/tests/exceptl kernel/v 1_3_1/tests/flagO kernel/v 13_1/tests/flagl kernel/v 1_3_1/tests/intrO kernel/v 1_3_1/tests/kclockO kernel/v 1_3_1/tests/kclockl kernel/vi_3_1/tests/kexceptl kernel/v 1-3_1/tests/kintrO kernel/v 1_3_1/tests/kmboxl kernel/v 1_3_1/tests/kmemfixl kernel/v 1_3_1/tests/kmemvarl kernel/v 1_3_1/tests/kmutexO kernel/v1 3 1/tests/kmutexl kernel/v1_3_1/tests/kschedl kernel/v1 3_1/tests/ksemrO kernel/v 1_3_1/tests/kseml kernel/v 1_31/tests/kflagO kernel/v1_3_1/tests/kflagl kernel/v 13_1/tests/kthreadO kernel/v1 3_1/tests/kthreadl kernel/v1 3_1/tests/mbox1 kernel/vi_3_1/tests/memfixI kernel/v 1_3_1/tests/memfix2 P P P P P P P P P P P P P P P P P P P P P P P N/A P P P P P P P P P P P P P P P P P Test kernel/v131/tests/memvarl kernel/vI_3_1/tests/memvar2 kernel/v 1_3 1/tests/mutexO kernel/v 1_3_1/tests/mutexI kernel/v 1_3_1/tests/mutex2 kernel/vI_3_1/tests/mutex3 kernel/v1_3_1/tests/schedl kernel/v_31/tests/sync2 kernel/v1_3_1/tests/sync3 kernel/v 1_3_1/tests/threadO kernel/v 1_3_1/tests/threadl kernel/v 1_3_1/tests/thread2 kernel/v 1_31/tests/release kernel/v1_3_1/tests/kill kernel/v 1_3_1/tests/thread-gdb kernel/v 1_3_1/tests/tm basic kernel/v1_3_1/tests/dhrystone kernel/v1_3_1/tests/stress threads kernel/013_1/tests/kcachel kernel/v 1_3_1/tests/kcache2 language/c/libc/v_31/tests/ctype/ctype language/c/libc/vl_3_1/tests/il8n/setlocale language/c/libc/v_ 3 1/tests/setjmp/setjmp language/c/libc/vl_3_1/tests/signal/signall language/c/libc/vl_3_1/tests/signal/signal2 language/c/libc/v1_3_1/tests/stdio/sprintfl language/c/libc/v1 3 1/tests/stdio/sprintf2 language/c/libc/vl_3_1/tests/stdio/sscanf language/c/libc/vl_3_1/tests/stdio/stdiooutput language/c/libc/v_ 3 1/tests/stdlib/abs language/c/libc/v1_3_1/tests/stdlib/atexit language/c/libc/vl 3 1/tests/stdlib/atoi language/c/libc/vI 3 1/tests/stdlib/atol language/c/libc/vI_3_1/tests/stdlib/bsearch language/c/libc/v1_3_1/tests/stdlib/div language/c/libc/v_ 3 1/tests/stdlib/getenv language/c/libc/v1_3_1/tests/stdlib/labs language/c/libc/v_31/tests/stdlib/ldiv language/c/libc/v_ 3 1/tests/stdlib/qsort language/c/libc/vl 3 1/tests/stdlib/mallocl language/c/libc/v13_1/tests/stdlib/malloc2 language/c/libc/v_ 3 1/tests/stdlib/malloc3 Table B.1: eCos Regression Tests Results for Brutus Port 59 Result P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P Result Test language/c/libc/vl language/c/libc/vl language/c/libc/vl language/c/libc/v_ 3 1/tests/stdlib/randl 3 1/tests/stdlib/rand2 3 1/tests/stdlib/rand3 3 1/tests/stdlib/rand4 3 1/tests/stdlib/realloc P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P language/c/libc/vl language/c/libc/v1_3_1/tests/stdlib/srand language/c/libc/vl 3_1/tests/stdlib/strtol language/c/libc/v1_3_1/tests/stdlib/strtoul language/c/libc/vl 3_1/tests/string/memchr language/c/libc/vl_3_1/tests/string/memcmpl language/c/libc/vl_3_1/tests/string/memcmp2 language/c/libc/vl_3_1/tests/string/memcpyl language/c/libc/vL3_/tests/string/memcpy2 language/c/libc/vl_3_1/tests/string/memmovel language/c/libc/vl 3_1/tests/string/memmove2 language/c/libc/vl_3_1/tests/string/memset language/c/libc/vl_3_1/tests/string/strcatl language/c/libc/v1_3_1/tests/string/strcat2 language/c/libc/v_ 3 1/tests/string/strchr language/c/libc/vl_3_1/tests/string/strcmpl language/c/libc/vl 3_1/tests/string/strcmp2 language/c/libc/v_3_/tests/string/strcolll language/c/libc/v1 3_1/tests/string/strcoll2 language/c/libc/v_ 3 1/tests/string/strcpyl language/c/libc/v13_1/tests/string/strcpy2 language/c/libc/vl 3_1/tests/string/strcspn language/c/libc/v1_3_1/tests/string/strlen language/c/libc/v1_3 1/tests/string/strncatl language/c/libc/v_ 3 1/tests/string/strncat2 language/c/libc/vl_3_1/tests/string/strncpyl language/c/libc/vl 3 1/tests/string/strncpy2 language/c/libc/vl 3_1/tests/string/strpbrk language/c/libc/vl 3_1/tests/string/strrchr language/c/libc/vl 3 1/tests/string/strspn language/c/libc/vl 3 1/tests/string/strstr language/c/libc/vl_3_1/tests/string/strtok language/c/libc/v_ 3 1/tests/string/strxfrml language/c/libc/vl_3_1/tests/string/strxfrm2 language/c/libc/vl_3_1/tests/time/asctime language/c/libc/vl 3 1/tests/time/clock language/c/libc/vL_3_1/tests/time/ctime language/c/libc/vl_3_1/tests/time/gmtime language/c/libc/vl 3 1/tests/time/localtime language/c/libc/vl 3 1/tests/time/mktime language/c/libc/v_ 3 1/tests/time/strftime Test language/c/libc/vl_3_1/tests/time/time language/c/libm/v _3_1/tests/vectors/acos language/c/libm/vl 3_1/tests/vectors/asin language/c/libm/vl_3_1/tests/vectors/atan language/c/libm/vl_3_1/tests/vectors/atan2 language/c/libm/vl 3_1/tests/vectors/ceil language/c/libm/v_ 3 1/tests/vectors/cos language/c/libm/vl_3_1/tests/vectors/cosh language/c/libn/vl_3_1/tests/vectors/exp language/c/libm/vl 3 1/tests/vectors/fabs language/c/libm/v_3_1/tests/vectors/floor language/c/libm/vl_3_1/tests/vectors/fmod language/c/ibm/v_3_1/tests/vectors/frexp language/c/libm/vl_3_1/tests/vectors/ldexp language/c/libni/vl_3_1/tests/vectors/log language/c/libn/v_ 3_1/tests/vectors/loglO language/c/libm/v_ 3 1/tests/vectors/modf language/c/libm/vl_3_1/tests/vectors/pow language/c/libm/v_ 3 1/tests/vectors/sin language/c/libm/vl_3_1/tests/vectors/sinh language/c/libm/vl_3_1/tests/vectors/sqrt language/c/libm/vl 3 1/tests/vectors/tan language/c/libm/vl3_1/tests/vectors/tanh devs/wallclock/vl_3_1/tests/wallclock devs/watchdog/vl_3 1/tests/watchdog compat/uitron/v L3_1/tests/testI compat/uitron/vl_3_1/tests/test2 compat/uitron/v1 3 1/tests/test3 compat/uitron/vl_3_1/tests/test4 compat/uitron/vl_3_1/tests/test5 compat/uitron/v_3 _1/tests/test6 compat/uitron/vl_3_1/tests/test7 compatluitron/v 13_1/tests/test8 compat/uitron/v1_3_1/tests/test9 compat/uitron/v1_3_1/tests/testcx2 compat/uitron/vl_3_1/tests/testcx3 compat/uitron/v13_1/tests/testcx4 compat/uitron/v 13_1/tests/testcx5 compat/uitron/vL_3_1/tests/testcx6 compat/uitron/vl_3_1/tests/testcx7 compat/uitron/vl3_1/tests/testcx8 compat/uitron/vl_3_1/tests/testcx9 compat/uitron/vl_3_1/tests/testcxx compat/uitron/vl_3_1/tests/testintr Table B.1: eCos Regression Tests Results for Brutus Port 60 Appendix C The Big Picture of StrongARM Power Consumption Although this data [1] is for the StrongARM 110 and not the 1100, it gives a good general idea of how much power goes to each functional unit in a processor of this type. In the StrongARM 1100, we would expect a considerable amount of power to also be consumed by certain peripheral controllers (such as the LCD controller) when they are enabled. Unit Power I-cache 27% I-box 18% D-cache 16% Clock 10% IMMU 9% E-box 8% DMMU 8% Write Buffer 2% Bus Interface 2% PLL 1% Table C.1: Breakdown of Power Consumption of SA-110 processor when running Dhrystone 61 62 Appendix D Instructions D.1 Building eCos for the Brutus The following directions are for building eCos on Linux. This is a bit harder than building on WinNT, since there you can do everything inside the eCos Configuration Tool. When working on linux, however, you are doing the mostly the same thing that the configuration tool does but without the convenience of the graphical interface. There are a couple of exceptions: for example, you must use the configuration tool if you need to edit a memory layout. Also note that when you edit the configurations manually on Linux, there is no automatic checking of the 'requires' and 'precludes' rules. You should be very careful not to violate these rules unless you know what you are doing. Use the following steps to guide you through the process: 1. Set the following variables appropriately (these values are only examples): setenv ECOSSRC /opt/ecos/src/ecos-1.2.10 setenv ECOSBUILD /opt/ecos/build/ecos-ram-1.2.10 setenv ECOSINSTALL /opt/ecos/install/ecos-ram-1.2.10 2. Use one of the below commands to create an eCos build tree. Add options as necessary to control which packages are included in the build. RAM: tclsh $ECOSSRC/packages/pkgconf.tcl -- force -- target=arm --platform=brutus -startup=ram --prefix=$ECOSINSTALL --builddir=$ECOS_BUILD -- srcdir=$ECOSSRC/ packages -enable-CYGPKGHAL -enable-CYGPKGINFRA -enable-CYGPKGKERNEL enable-CYGPKGLIBC -enable-CYGPKGLIBM -enable-CYGPKGERROR -enableCYGPKGHALARM -enable-CYGPKG_HAL_ARM_BRUTUS -enable-CYGPKGIO -enableCYGPKGIOSERIAL -enable-CYGPKGDEVICESWALLCLOCK ROM: tclsh $ECOSSRC/packages/pkgconf.tcl -- force -- target=arm --platform=brutus -startup=rom --prefix=$ECOSINSTALL --builddir=$ECOSBUILD -- srcdir=$ECOSSRC/ packages -enable-CYGPKGHAL -enable-CYGPKGINFRA -enable-CYGPKGKERNEL enable-CYGPKGLIBC -enable-CYGPKGLIBM -enable-CYGPKGERROR -enableCYGPKGHALARM -enable-CYGPKG_HAL_ARM_BRUTUS -enable-CYGPKGIO -enableCYGPKGIOSERIAL -enable-CYGPKGDEVICESWALLCLOCK STUBS: tclsh $ECOS_SRC/packages/pkgconf.tcl -- force -- target=arm --platform=brutus -startup=stubs --prefix=$ECOSINSTALL --builddir=$ECOS_BUILD -srcdir=$ECOSSRC/packages -enable-CYGPKGHAL -enable-CYGPKGINFRA -enableCYGPKGHAL_ARM -enable-CYGPKG_HAL_ARM_BRUTUS 3. cd $ECOS_BUILD/pkgconf 63 4. emacs pkgconf.mak Replace "1100" with "110" until a newer version of gcc has been installed. 5. emacs *.h Edit the header files configure the fine-grain options of eCos. Listed below with each type of configuration are some examples of variables you may need to change from their defaults. Note that some header files are automatically generated and you should never edit them by hand. Fine grain configuration for RAM startup: hal.h: #define #define #undef #define CYGDBGHALDEBUG_GDB_INCLUDESTUBS CYGDBGHALDEBUG_GDB_BREAKSUPPORT CYGDBGHALDEBUGGDBCTRLCSUPPORT CYGDBGHALDEBUGGDBTHREADSUPPORT halarmbrutus.h: #define CYGHWRHALARMBRUTUSSTARTUP ram #define CYGHWRHALARMBRUTUSPROCESSORCLOCK 191700 libc.h: #define CYGSEMLIBCPERTHREADRAND #define CYGNUMLIBCMALLOCMEMPOOLSIZE ioserial.h: 4000000 #define CYGPKG_I0_SERIAL_ARM_BRUTUS #define CYGPKGIOSERIAL_ARMBRUTUSSERIAL1 Fine grain configuration for ROM startup: hal .h: #define #define #undef #define CYGDBGHALDEBUG_GDB_INCLUDESTUBS CYGDBGHALDEBUGGDBBREAKSUPPORT CYGDBGHALDEBUGGDBCTRLCSUPPORT CYGDBGHALDEBUGGDB_THREADSUPPORT halarmbrutus.h: #define CYGHWRHALARMBRUTUSSTARTUP rom #define CYGHWRHALARMBRUTUSPROCESSORCLOCK 191700 libc.h: #define CYGSEMLIBCPERTHREADRAND #define CYGNUM_LIBCMALLOCMEMPOOLSIZE 4000000 Fine grain configuration for ROM stubs: hal .h: #undef #undef #define #define #undef #undef #define CYGFUNHALCOMMONKERNELSUPPORT CYGPKGHALEXCEPTIONS CYGDBGHALDEBUG_GDB_INCLUDESTUBS CYGDBGHALDEBUGGDBBREAKSUPPORT CYGDBGHALDEBUGGDBCTRLCSUPPORT CYGDBGHALDEBUGGDBTHREADSUPPORT CYGHALROMMONITOR halarmbrutus.h: #define CYGHWRHALARMBRUTUSSTARTUP stubs #define CYGHWRHALARMBRUTUSPROCESSORCLOCK 191700 6. cd . . 7. setenv CC arm-elf-gcc 8. make This will generate the install tree, which will contain the ecos library that you need to link with your ecos applications. Watch carefully for any warnings during the build. For Stubs only, to generate the actual image to burn into the flash you now do this: 64 1. make -C hal/common/vi_2_10/src/stubrom 2. cd hal/common/vl_2_10/src/stubrom 3. arm-elf-objcopy -o binary stubrom stubrom.img 4. Now follow the directions for writing the stubrom. img into a flash. If you compiled eCos for RAM or ROM start, now you can build your applications with this new kernel by pointing the PKGINSTALLDIR variable in their makefiles to $ECOSINSTALL. D.2 Programming Flash Memories If you compiled your eCos application as a ROM image or built eCos gdb stubs as a ROM image, you will need to program the image onto a pair of flashes. You will need to have the Arm SDT installed to do this. The installation CD contains a version for Solaris that you should install on a Sun somewhere. Make sure that the flashes you want to program are in the left-hand pair of sockets (labelled U34, U44), that angel is in the right-hand sockets (labelled U35, U45)), and that a there is a serial cable with a null modem connecting the Sun to the front-most serial port on the Brutus. Now execute the following instructions on the Sun (assuming both f mu. axf and your compiled stubrom. img are in the current directory): armsd -remote -adp -port 1 -line load fmu.axf go writeflash 262144 stubrom.img readflash 262144 stubrom.chk quit quit diff stubrom.img stubrom.chk 65 38400 66 Appendix E Raw Data From Experiments E.1 Data from Voltage/Frequency Scaling Experiment Table E. 1 gives the actual measured data that was used to generate Figure 3.1 on page 49. This data was acquired on February 18, 2000, using a Keithley 2400 Sourcemeter. The numbers are current in mA at the 5V supply1 7 to the DVS Test Board, as it was powering the Brutus running and eCos application which keeps the processor at 100% load and allows voltage and frequency to be adjusted from the keyboard. Note that there are some additional voltage values in this data (0.925, 0.975, 1.025, 1.075, 1.125, 1.175, 1.225, and 1.275) that were excluded from Figure 3.1 because they would have caused a confusing scale change in the voltage axis at 1.300V. The shaded area of the table corresponds to voltage/frequency combinations at which the SA- 1100 failed to run. By comparing the voltage values along the upper edge of the shaded area to the values in Table 3.1 on page 49, you can see that we have used a suitable safety margin to prevent DVS from causing the SA- 1100 to crash under adverse conditions (such as high temperatures). Finally, note that the upper bound on energy savings that can be achieved with our DVS on this hardware is: 1 x 5.16 x 0.00824 1 - 59000000 1 x 5.16 x 0.06102 206400000 17. The nominal 5V supply actually averaged about 5. 16V. 67 = 52.8% Frequency (MHz) 59.0 73.7 88.5 103.2 1 118.0 1 132.7 147.5 162.2 176.9 191.7 206.4 1.600 25.35 29.91 34.45 38.98 43.58 48.20 52.54 57.35 61.16 65.10 69.52 1.550 23.62 27.90 32.23 36.46 40.80 45.11 49.22 53.75 57.35 61.08 65.15 1.500 22.01 26.05 30.13 34.09 38.17 42.17 46.15 50.35 53.64 57.22 61.02 1.450 20.46 24.21 28.03 31.75 35.57 39.33 43.00 46.96 50.11 53.46 57.04 1.400 19.16 22.67 26.20 29.71 33.28 36.81 40.22 43.87 46.89 50.07 53.33 1.350 17.71 20.99 1.300 16.43 19.50 1.275 16.09 19.03 1.250 15.38 18.22 1.225 14.80 17.54 1.200 14.14 16.77 1.175 13.70 16.25 1.150 13.07 15.52 1.125 12.54 14.90 1.100 11.92 14.19 1.075 11.66 13.83 1.050 11.06 13.15 1.025 10.59 12.59 1.000 10.03 11.94 0.975 9.71 11.53 0.950 9.17 10.92 0.925 8.76 10.42 0.900 8.24 9.83 0 Table E.1: Data from Voltage/Frequency Scaling Experiment 68 E.2 Data From Variable-Length Filter DVS Experiment The measured data used to create Figure 3.2 on page 50 appears in Table E.2. This data was collected on February 20, 2000, using a Keithley 2000 Sourcemeter. The sourcemeter was connected between the output of the DVS test board and the power supply input of the SA-1 100 core, so this experiment does not take into account the power drawn by the DC-DC converter itself. The Brutus was running an application that performs convolutional filtering of a signal using variable lengths of FIR filters (from 32 to 192 in steps of 8). For each filter length, the application would lower the frequency and voltage as much as possible while still filtering the input data at the rate it was being generated. The experiment was also done with varying frequency but fixed voltage as a control. Again, some simple arithmetic shows that the amount by which we can scale down the energy used to filter the same input data but with varying output quality is: 1 1- 59000000 x 0.900 x 0.01359 10854 2064I000 x 1.500 x 0. 69 = Load (%) Current with varying voltage (mA) Current at fixed 1.5V (mA) 59.0/0.900 91 13.59 35.71 40 59.0 / 0.900 95 13.69 35.90 48 73.7/0.950 92 17.64 43.61 56 73.7/0.950 96 17.71 43.84 64 88.5 /0.975 93 22.98 51.42 72 88.5/0.975 97 23.09 51.65 80 103.2 / 1.000 94 28.80 59.15 88 103.2 / 1.000 98 28.94 59.45 96 118.0 / 1.025 95 36.87 66.63 104 118.0 / 1.025 98 37.01 66.89 112 132.7 / 1.100 95 46.42 73.94 120 147.5 / 1.150 93 54.93 80.98 128 147.5 / 1.150 96 55.05 81.19 136 162.2 / 1.200 93 64.45 87.99 144 162.2 / 1.200 97 64.59 88.20 152 176.9 / 1.300 94 77.29 94.93 160 176.9 / 1.300 97 77.41 95.08 168 191.7 / 1.400 95 88.85 101.74 176 191.7 / 1.400 98 88.98 101.91 184 206.4 / 1.500 95 108.43 108.43 192 206.4 / 1.500 99 108.54 108.54 FIR Filter Length Freq. (MHz) Voltage 32 Table E.2: Data from Variable-Length Filter DVS Experiment 70 Appendix F Photos Figure F.1: Photo of Brutus Board Figure F.2: Photo of DVS Test Board Figure F.3: Photo of Brutus with DVS Board connected 71 Figure F.4: Screen Shot of Graphical Demo on Brutus LCD Figure F.5: StrongARM 1100 Chip Photo 72 References [1] K. Asanovic, "Vector Microprocessors," Ph.D. thesis, University of California, Berkeley, Spring 1998. [2] A. Chandrakasan, "Basics of Low Power Circuit and Logic Design," [Online tutorial], Available: http://www-mtl.mit.edu/research/icsystems/tutorials/ [3] A. Chandrakasan, R. Amirtharajah, S.H. Cho, J. Goodman, G. Konduri, J. Kulik, W. Rabiner, A. Wang, "Design Considerations for Distributed Microsensor Systems," In Proc. IEEE 1999 Custom Integrated Circuits Conference (CICC '99), May 1999, pp. 279-286. [4] Cygnus Solutions, "Cygnus eCos Public License Version 1.0," [Online document], Available: http:/ /www.cygnus.com/ecos/ecoslicense.html [5] Cygnus Solutions, "Cygnus eCos Market Backgrounder," [Online document], Available: http:// www.cygnus.com/ecos/mrktbgrnd.pdf [6] Cygnus Solutions, "Cygnus eCos White Paper," [Online document], Available: http:// www.cygnus.com/ecos/wp.pdf [7] Cygnus Solutions, "EL/IX White Paper," [Online document], Available: http:// sourceware.cygnus.com/elix/whitepaper.html [8] M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, "Cache-Oblivious Algorithms," extended abstract submitted for publication, Available: http://supertech.lcs.mit.edu/cilk/papers/ [9] Intel, "Intel StrongARM SA- 1100 Microprocessor Developer's Manual," August 1999. [10] Intel, "StrongARM SA-1 100 Developer Board Firmware Kit User's Guide, November 1998. [11] Intel, "StrongARM SA-1 100 Microprocessor Evaluation Platform User's Guide," October 1998. [12] M. Klein, et al., A Practitioner'sHandbookfor Real-time Analysis: Guide to Rate Monotonic Analysisfor Real-time Systems, Boston: Kluwer Academic Publishers, 1993. [13] T. C. Lee, V. Tiwari, "A Memory Allocation Technique for Low-Energy Embedded DSP Software," Proceedingsof the 1995 IEEE Symposium on Low Power Electronics, San Diego, CA, October 1995. [14] J. Lorch, "A Complete Picture of the Energy Consumption of a Portable Computer," Masters Thesis, Computer Science, University of California at Berkeley, December 1995. [15] J. Lorch, A. Smith, "Energy Consumption of Apple Macintosh Computers," IEEE Micro, 18(6):5463, November/December 1998. [16] J. Lorch, A. Smith, "Reducing Processor Power Consumption by Improving Processor Time Management in a Single-User Operating System," Proceedingsof the Second ACM International Conference on Mobile Computing and Networking, Rye Brook, NY, 143-154, November 1996. [17] J. Lorch, A. Smith, "Scheduling Techniques for Reducing Processor Energy Use in MacOS," Wireless Networks, 3(5):311-324, October 1997. [18] J. Lorch, A. Smith, "Software Strategies for Portable Computer Energy Management," IEEE Personal Communications Magazine, 5(3):60-73, June 1998. [19] R. Min, T. Furrer, A. Chandrakasan, "Dynamic Voltage Scaling Techniques for Distributed Microsensor Networks," WVLSI '00, April 2000. [20] MIT MTL Integrated Circuits and Systems group, ptAMPS project web site, Available: http://wwwmtl.mit.edu/research/icsystems/uamps/ 73 [21] M. Pedram, Q. Wu, "Design Considerations for Battery-Powered Electronics," In Proceedings of the 36th Design Automation Conference, 1999. [22] T. Pering, "Dynamic Voltage Scaling and an Overview of Low-Power Microprocessors," Presentation given at the University of Washington, Nov. 1998, Available: http:// infopad.eecs.berkeley.edu/-pering/lpsw/lpsw.html [23] T. Pering, R. Brodersen, "Energy Efficient Voltage Scheduling for Real-Time Operating Systems," work in progress paper for RTAS'98. Available: http://infopad.eecs.berkeley.edu/-pering/lpsw/ lpsw.html [24] T. Pering, T. Burd, R. Brodersen, "Dynamic Voltage Scaling and the Design of a Low-Power Microprocessor System," Workshop on Low-Power Microprocessors at the 1998 International Symposium of Computer Architecture, June 1998. [25] T. Pering, T. Burd, R. Brodersen, "The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms," In 1998 InternationalSymposium on Low-Power Electronics and Design. [26] H. Prokop, "Cache-Oblivious Algorithms," Masters Thesis, MIT Department of Electrical Engineering and Computer Science, June 1999. [27] Q. Qiu, M. Pedram, "Dynamic Power Management Based on Continuous-Time Markov Decision Processes," In Proceedings of the 36th Design Automation Conference, 1999. [28] A. Sinha, "Energy-Scalable Software," Masters Thesis, MIT Department of Electrical Engineering and Computer Science, 2000. [29] Y. Shin, K. Choi, "Power Conscious Fixed Priority Scheduling for Hard Real-Time Systems," In Proceedingsof the 36th DesignAutomation Conference, 1999. [30] W. Shiue, C. Chakrabarti, "Memory Exploration for Low Power Embedded Systems," In Proceedingsof the 36th Design Automation Conference, 1999. [31] M. Srivastava, A. Chandrakasan, R. Brodersen, "Predictive System Shutdown and Other Architectural Techniques for Energy Efficient Programmable Computation," In IEEE Transactions on VLSI Systems, Vol. 4, No. 1, March 1996. [32] D. Stepner, N. Rajan, D. Hui, "Embedded Application Design Using a Real-Time OS," In Proceedingsof the 36th Design Automation Conference, 1999. [33] A. Stratakos, T. Burd, R.W. Brodersen, "Integrated Voltage Regulator and Clock Generator for Dynamic Voltage and Frequency Scaling," [Online presentation], Available: http:// bwrc.eecs.berkeley.edu/burd/gpp/slides/IcSeminar. 11-96/ [34] H. Takada, "Designing Small-Scale Embedded Systems with uITRON Kernel" [Online presentation] Available: http://www.ertl.ics.tut.ac.jp/-hiro/escs98-ohp.pdf [35] V. Tiwari, "Logic and System Design for Low Power Consumption," Ph.D. thesis, Princeton University, 1996. [36] V. Tiwari, S. Malik, A. Wolfe, T.C. Lee, "Instruction Level Power Analysis and Optimization of Software," Journalof VLSI Signal ProcessingSystems, Vol. 13, No. 2, August 1996, Available: http://www.ee.princeton.edu/-vivek/publications.html [37] Transmeta Corporation, "The Technology Behind CrusoeTM Processors," [Online Whitepaper], January 2000, pp. 16-17, Available: http://www.transmeta.com/crusoe/download/pdf/ crusoetechwp.pdf [38] A. Wang, W. Rabiner Heinzelman, and A. Chandrakasan, "Energy-Scalable Protocols for BatteryOperated Microsensor Networks," IEEE Workshop on Signal Processing Systems (SiPS '99), October 1999, Taipei, Taiwan. 74