embedded literature

advertisement
nic kaj posebnega
Processor Types
Where did it come from?
Anybody reading this in the UK will no doubt be familiar with Acorn's BBC micro.
Some, sadly, seem to feel that the company never made it beyond that "odd thing with
the black and red keys", while others can cast their mind back to the moment they
booted RISC OS 4 on their new Kinetic, and gloat.
Either way, Acorn made use of the 6502 processor in the Atom, some kits, and some
rackmount machines in the late seventies. As 1980 rolled in, the BBC went looking
for a computer to fit a series of programmes they wanted to produce. Rather than
these days, when the programmes are much more likely to fit the computer; the BBC
had in mind the sort of specification it was looking for. A number of companies well
known at the time tendered their designs. Acorn revamped the Atom design, throwing
into it as much as possible, and building an entire working machine from the ground
up in a matter of days. That's the stuff legends are made of, and that seems to be the
stuff Acorn was good at, like "Hey, guys, let's pull an all-nighter and write an
operating system".
The BBC loved the machine, and the rather naffly named "The Micro Program" was
released in 1982 alongside the BBC microcomputer. It filled school computer rooms.
Many were sold. Not many in American terms, but staggering in European terms.
The BBC micro, like earlier Acorn machines, was based around the 6502 processor as were other popular computers such as the Apple II.
From the outset, you could have colour graphics and text on-screen. Not to be
outdone, the BBC micro offered seven screen 'modes' of varying types - ranging from
high resolution monochrome to eight colour (plus eight flashing colours) and an eight
colour 'teletext' mode that only required 1K of memory per screen; a cassette interface
for cheap and cheerful use, on-board provision for a floppy disc interface (you needed
to add a couple of ICs like the 1772 disc controller, that's all), serial, four channel
analogue, eight channel digital I/O, tube for co-processors, a 1MHz system bus for
serious fiddling and for harddiscs... and by adding a couple of extra components, you
had built-in networking.
Econet might have been slow and simple, but it was a revolution in those days, when
it was stated that Bill Gates, among other notable gaffs, asked "what's a network?" though this may well be urban legend. In any case, running multiple processor
systems, and networking all sorts of machines was something that Acorn users were
au fait with long before the PC marketplace kicked off, never mind implementing
such things for itself.
However, Acorn had their sights set on the future, and between 1983 and 1985 the
ARM processor design was designed by Steve Furber and Sophie Wilson (or, Roger
Wilson, back then). This was a leap of faith and optimism, when only a year previous
they had released a 32K 8 bit machine, they were then designing a 32 bit machine that
could cope with up to 16Mb RAM, and some ROM as well.
Why?
Acorn continued to produce the BBC micro and variants. Indeed, the production of
their most successful version of the BBC micro - the Master - only finished in May
1993. However, back a decade in 1983 it was quite clear to the innovators inside
Acorn that the next generation of machine should provide something far better than
rehashing old ideas over and over. In this, lay the problem. Which processor to use?
There was nothing that stood out from the crowd. Acorn had produced a machine with
the 16 bit 6502-alike, the 65C816, but this wasn't up to the vision that Acorn had.
They tried all of the 16 and 32 bit processors available by building second processor
units for the BBC micro to aid in their evaluation.
So there was one idea left. To make the processor that they were looking for.
Something that kept the ideals of the 6502, but provided raw power. Something small,
cheap - both to produce and to power, and something fairly simple both internally and
to program. The important early design decisions were to use a fixed instruction
length (which makes it possible to accurately disassemble any random memory
address simply by looking to see what is there - every instruction is word aligned),
and to use a load/store model.
In that day, companies were talking about extending their CISC processors. The 8088
became the 80186 (briefly), the 80286, and so on to the processor it is today. RISC
processors existed, but the majority of them were designed in-house as embedded
controllers. Acorn took their ideas and requirements and wrote a BASIC program that
emulated the ARM 1 instruction set. The designers of the processor were new to
processor design, some of the tools used were not exactly cutting edge. This prevented
the processor design from being large and complex, which in it's way was the best
thing, and is now being spun as a 'plus' for the ARM processor, as indeed it is.
While Acorn had very clear ideas of what they wanted the processor to do, they also
wanted good all-round performance, rather than something so tailored to the end
design that it obsoletes itself.
So. For the processor, Acorn rolled their own.
Please, take a moment to consider this.
Not only did Acorn create an entire powerful and innovative operating system with a
tiny crew (MicroSoft probably employs more people to clean their toilets than Acorn
employed in total); but they also designed their own chipset.
So basically these guys designed an entire computer from the ground up, on a tiny
budget and with a tiny workforce.
You can fault Acorn for many things - lack of development, lack of advertising - but
you can never fault them for having the sheer balls to pull it off in the first place.
At the time the "Archimedes" was released, it was widely touted as the world's fastest
desktop machine. It also boasted a display system that could spit out loads of different
resolutions. My A5000 (same video hardware) can output 640x480 at 256 colours, or
800x600 at 16 colours. It doesn't sound impressive, but this was using hardware
developed in the mid '80s. The rest of the world (save Apple Macs) was using CGA
and like; or Hercules for the truly deranged!
Not a lot was made of the fact that the machines were RISC. Maybe Acorn figured the
name of the operating system (RISC OS) was a big hint. Maybe they figured they had
enough going for the machine without getting all geeky.
So when, in the early '90s, Apple announced the world's first RISC desktop machine,
we laughed. And Acorn ran a good-humoured advert in the Times welcoming Apple
to RISC.
The chipset was:

ARM2
This is the central processor, and originally stood for "Acorn RISC
microprocessor" (rather than "ARM RISC machine", or whatever they've
called it today (may I suggest "Advanced RISC Microprocessor"?)).

MEMC1 (Anna)
This was the MEMory Controller. It was very soon replaced by the MEMC1a,
which I do not think had a name.
The RiscPC generation of machines use a MMU (Memory Management Unit).

VIDC1 (Arabella)
This was the VIDeo Controller, though due to all it was capable of doing to
pixels and sound, many knew it as the Very Ingenious Display Contraption.
Certainly, the monitors that cannot be supported under RISC OS are few and
far between. It is a trivial matter to switch from a modern 21" SVGA monitor
to a television monitor.
The RiscPC generation of machines use the VIDC20, which takes it the logical
step further. Unfortunately, the VIDC is no longer able to keep up with the
latest advances in display driver technology. Enter J. Kortink with his
ViewFinder.

IOC (Albion)
This was the Input/Output Controller, and it looked after podules and
keyboards and basically anything that did I/O. In a flash of inspiration, it
offered an IIC interface which is available on the expansion bus. My teletext
receiver is hooked into this.
RiscPC generation machines use the IOMD which is like a souped up IOC.
The ARM250 (mezzanine / macrocell) offered the ARM chipset on one piece of
silicon. It was used in the A3010 and A3020 machines. It may have also been used in
the A4000, but I've not seen inside such a machine.
The original operating system of the ARM-based machine was to be ARX, but it was
taking too long and was running overbudget. So Arthur was designed. It has been said
that Arthur's name derives from the porting of the BBC MOS "A RISC operating
system by Thursday". Sadly, it has a lot of the hang ups of the BBC micro, such as a
lack of memory protection (like 'modules' running in SVC mode (really only the
kernel should run in SVC mode)), there's a plethora of unrelated things done with the
OS_Byte SWI, the service call mechanism...
From Arthur came RISC OS, which improved certain aspects of the system, but
perhaps the most significant improvement was the Desktop. Instead of a bizarre
looking (and horribly coloured) thing that could only run a task at a time, it introduced
proper co-operative multitasking.
The debate between pre-emptive and co-operative multitasking is legion, but I feel
that Acorn wanted co-operative. That it was a design decision instead of a cop-out.
Because, while it makes it slightly harder to program and more liable to problems
with errant tasks, it fits so beautifully into Acorn's ethos. There's no process
'protection' like on Unix. You can drop to a privileged processor mode with little more
than a SWI call, and a lot of stuff (that probably shouldn't) runs in SVC mode.
Because, at it's heart, RISC OS is a hacker's operating system. Not the same type of
'hacking' that Linux and netbsd comes from - such things were not known in the
home/office computer sector in those days, but in it's way, RISC OS is practically
begging for you to whip out the disassembler and start poking around it's internals.
The original Arthur PRMs said that any serious application would be written in
assembler (a view they later changed, to suggesting serious applications would be
written in C).
When the ARM processor team split off into ARM Ltd, they adopted a new
numbering system for the processors. Originally, the numerical suffix reflected the
revision of the device, the ARM 1, the ARM 2, the ARM 3 ... followed by the ARM
two-and-a-half, which is 250 in the tradition of multiplying version numbers by a
hundred.
Now, the single number reflects the macrocell as is always - ARM6, ARM7...
A processor with a twin number denotes a self-contained processor and basic interface
circuitry, like the ARM60 and the VIDC20 (VIDC not strictly a processor, but part of
the ARM chipset).
A processor with a triple number denotes the processor macrocell combined with
other macrocells, or custom logic, like the ARM610 and the ARM710. Because of the
simplicity of the designs, and the predefined parts, the ARM610 went from
specification to silicon in under four months. Short development times are invaluable
for custom devices where every development day matters... It also matters that ARM's
designs will arrive on time, so you don't end up with your computer or PDA (or
whatever) sitting there awaiting the processor. Within ARM's converted barn, a line of
opened champagne bottles line the staircase - a testament to how many of their
designs worked from the very first silicon implementation - which is virtually every
single one of them.
So there you have it.
From an idea to a global leader in microprocessors (Intel has said recently it is making
more ARM silicon than x86 silicon), the ARM processor's birth is wrapped in
spectacular innovation.
While it is not entirely certain where RISC OS is heading, one thing is for sure. The
beautiful processor in our RISC OS machines is going from strength to strength.
We at Heyrick wish ARM Ltd all the best...
Return to assembler index
Copyright © 2004 Richard Murray
Where might you
find an ARM?
The ARM processor is a powerful low-cost, efficient, low-power (consumption, that
is) RISC processor. It's design was originally for the Archimedes desktop computer,
but somewhat ironically numerous factors about its design make it unsuitable for use
in a desktop machine (for example, the MMU and cache are the wrong way around).
However, many factors about its design make it an exceptional choice for embedded
applications.
So while many PCs can scream "Intel(R) inside", there are a steadily increasing
number of devices that could scream "ARM inside", only ARM doesn't have an ego
anywhere near as large as Intel. Oh, and yes I am aware that Intel are fabricating
ARM processors. Oh what a tangled web we weave...
[last updated January 2002]























Gameboy Advance games console
Daewoo inet.top.box
Bush Internet TV / box
Datcom 2000 digital satellite receiver
Pace digital satellite receiver (supplied as part of the Sky package)
Numerous other digital cable / satellite receivers
Hauppauge WinTV DVB-S PC TV card
Oracle NC
LG Java computer
Millipede Apex Imager video board
Paradise AiTV set top box
Sony MZ-R90 minidisc
Win-Jam
JVC's digital camera 'Pixstar'
Lexmark Z12/22/32/42/52 colour Jetprinter
Samsung office laser printer
Samsung SmartJet MFP (printer/scanner/copier/fax)
Xerox colour inkjet printer
Digital logic analyzers from Controlware
IHU-2 Experimental Space Flight Computer
Siemens video phone
Wizcom's Quicktionary
Various GSM handsets, from the likes of Alcatel, AEG, Ericsson, Kenwood,
NEC, Nokia...













Cable/ADSL modems, by manufacturers such as Caymen Systems, D-Link,
and Zoom.
3Com 3CD990-TX-97 10/100 PCI NIC with 3XP processor
Routers, bus adaptors, servers, crypto, gateways...
POS systems
Smart cards
Adaptec PCI to Ultra2 SCSI 64 bit RAID controller
ATA drive electronics controller systems (bare)
Iomega HipZip digital audio player
C pen, with OCR and IrDA
HP/Ericsson/Compaq pocket PCs
Psion series 5 hand-held PC (5mx used 36MHz ARM710T)
Various PDAs
And, of course, all of us using Archimedes / BBC (A30x0) / NetStation /
RiscPC / A7000 / Mico / RiscStation computers!!!
This is not a complete list. Visit http://www.arm.com/ for a full list, with links to each
item.
The above may not use ARM processors, but other hardware produced by ARM. It is
rather difficult to discover what it actually inside half of these things, without owning
one and taking it apart!
A site that gives images and interesting background information is the details of the
IHU-2 experimental space flight computer.
Return to assembler index
Copyright © 2004 Richard Murray
RISC
vs
CISC
You can read a reply to this text by going here.
In the early days of computing, you had a lump of silicon which performed a number
of instructions. As time progressed, more and more facilities were required, so more
and more instructions were added. However, according to the 20-80 rule, 20% of the
available instructions are likely to be used 80% of the time, with some instructions
only used very rarely. Some of these instructions are very complex, so creating them
in silicon is a very arduous task. Instead, the processor designer uses microcode. To
illustrate this, we shall consider a modern CISC processor (such as a Pentium or
68000 series processor). The core, the base level, is a fast RISC processor. On top of
that is an interpreter which 'sees' the CISC instructions, and breaks them down into
simpler RISC instructions.
Already, we can see a pretty clear picture emerging. Why, if the processor is a simple
RISC unit, don't we use that? Well, the answer lies more in politics than design.
However Acorn saw this and not being constrained by the need to remain totally
compatible with earlier technologies, they decided to implement their own RISC
processor.
Up until now, we've not really considered the real differences between RISC and
CISC, so...
A Complex Instruction Set Computer (CISC) provides a large and powerful range of
instructions, which is less flexible to implement. For example, the 8086
microprocessor family has these instructions:
JA
JAE
JB
...
JPO
JS
JZ
Jump if Above
Jump if Above or Equal
Jump if Below
Jump if Parity Odd
Jump if Sign
Jump if Zero
There are 32 jump instructions in the 8086, and the 80386 adds more. I've not read a
spec sheet for the Pentium-class processors, but I suspect it (and MMX) would give
me a heart attack!
By contrast, the Reduced Instruction Set Computer (RISC) concept is to identify the
sub-components and use those. As these are much simpler, they can be implemented
directly in silicon, so will run at the maximum possible speed. Nothing is 'translated'.
There are only two Jump instructions in the ARM processor - Branch and Branch with
Link. The "if equal, if carry set, if zero" type of selection is handled by condition
options, so for example:
BLNV
BLEQ
Branch with Link NeVer (useful!)
Branch with Link if EQual
and so on. The BL part is the instruction, and the following part is the condition. This
is made more powerful by the fact that conditional execution can be applied to most
instructions! This has the benefit that you can test something, then only do the next
few commands if the criteria of the test matched. No branching off, you simply add
conditional flags to the instructions you require to be conditional:
SWI
MVNVS
MOVVC
"OS_DoSomethingOrOther"
R0, #0
R0, #0
; call the SWI
; If failed, set R0 to -1
; Else set R0 to 0
$...whatever...
AX, 0
failed
; call the interrupt
; did it return zero?
; if so, it failed, jump to
DX, 0
; else set DX to 0
Or, for the 80486:
INT
CMP
JE
fail code
MOV
return
RET
failed
MOV
JMP
; and return
DX, 0FFFFH
return
; failed - set DX to -1
The odd flow in that example is designed to allow the fastest non-branching
throughput in the 'did not fail' case. This is at the expense of two branches in the
'failed' case.
I am not, however, an x86 coder, so that can possibly be optimised - mail me if you
have any suggestions...
Most modern CISC processors, such as the Pentium, uses a fast RISC core with an
interpreter sitting between the core and the instruction. So when you are running
Windows95 on a PC, it is not that much different to trying to get W95 running on the
software PC emulator. Just imagine the power hidden inside the Pentium...
Another benefit of RISC is that it contains a large number of registers, most of which
can be used as general purpose registers.
This is not to say that CISC processors cannot have a large number of registers, some
do. However for it's use, a typical RISC processor requires more registers to give it
additional flexibility. Gone are the days when you had two general purpose registers
and an 'accumulator'.
One thing RISC does offer, though, is register independence. As you have seen above
the ARM register set defines at minimum R15 as the program counter, and R14 as the
link register (although, after saving the contents of R14 you can use this register as
you wish). R0 to R13 can be used in any way you choose, although the Operating
System defines R13 is used as a stack pointer. You can, if you don't require a stack,
use R13 for your own purposes. APCS applies firmer rules and assigns more
functions to registers (such as Stack Limit). However, none of these - with the
exception of R15 and sometimes R14 - is a constraint applied by the processor. You
do not need to worry about saving your accumulator in long instructions, you simply
make good use of the available registers.
The 8086 offers you fourteen registers, but with caveats:
The first four (A, B, C, and D) are Data registers (a.k.a. scratch-pad registers). They
are 16bit and accessed as two 8bit registers, thus register A is really AH (A, highorder byte) and AL (A low-order byte). These can be used as general purpose
registers, but they can also have dedicated functions - Accumulator, Base, Count, and
Data.
The next four registers are Segment registers for Code, Data, Extra, and Stack.
Then come the five Offset registers: Instruction Pointer (PC), SP and BP for the stack,
then SI and DI for indexing data.
Finally, the flags register holds the processor state.
As you can see, most of the registers are tied up with the bizarre memory addressing
scheme used by the 8086. So only four general purpose registers are available, and
even they are not as flexible as ARM registers.
The ARM processor differs again in that it has a reduced number of instruction
classes (Data Processing, Branching, Multiplying, Data Transfer, Software Interrupts).
A final example of minimal registers is the 6502 processor, which offers you:
Accumulator
X register
Y register
PC
SP
PSR
-
for results of arithmetic instructions
First general purpose register
Second general purpose register
Program Counter
Stack Pointer, offset into page one (at &01xx).
Processor Status Register - the flags.
While it might seem like utter madness to only have two general purpose registers, the
6502 was a very popular processor in the '80s. Many famous computers have been
built around it.
For the Europeans: consider the Acorn BBC Micro, Master, Electron...
For the Americans: consider the Apple2 and the Commadore PET.
The ORIC uses a 6502, and the C64 uses a variant of the 6502.
(in case you were wondering, the Speccy uses the other popular processor - the ever
bizarre and freaky Z80)
So if entire systems could be created with a 6502, imagine the flexibility of the ARM
processor.
It has been said that the 6502 is the bridge between CISC design and RISC. Acorn
chose the 6502 for their original machines such as the Atom and the System# units.
They went from there to design their own processor - the ARM.
To summarise the above, the advantages of a RISC processor are:

Quicker time-to-market. A smaller processor will have fewer instructions, and
the design will be less complicated, so it may be produced more rapidly.

Smaller 'die size' - the RISC processor requires fewer transistors than
comparable CISC processors...
This in turn leads to a smaller silicon size (I once asked Russell King of
ARMLinux fame where the StrongARM processor was - and I was looking
right at it, it is that small!)
...which, in turn again, leads to less heat dissipation. Most of the heat of my
ARM710 is actually generated by the 80486 in the slot beside it (and that's
when it is supposed to be in 'standby').

Related to all of the above, it is a much lower power chip. ARM design
processors in static form so that the processor clock can be stopped
completely, rather than simply slowed down. The Solo computer (designed for
use in third world countries) is a system that will run from a 12V battery,
charging from a solar panel.

Internally, a RISC processor has a number of hardwired instructions.
This was also true of the early CISC processors, but these days a typical CISC
processor has a heart which executes microcode instructions which correlate to
the instructions passed into the processor. Ironically, this 'heart' tends to be
RISC. :-)

As touched on my Matthias below, a RISC processor's simplicity does not
necessarily refer to a simple instruction set.
He quotes LDREQ R0,[R1,R2,LSR #16]!, though I would prefer to quote the
26 bit instruction LDMEQFD R13!, {R0,R2-R4,PC}^ which restores R0, R2,
R3, R4, and R15 from the fully descending stack pointed to by R13. The stack
is adjusted accordingly. The '^' pushes the processor flags into R15 as well as
the return address. And it is conditionally executed. This allows a tidy 'exit
from routine' to be performed in a single instruction.
Powerful, isn't it?
The RISC concept, however, does not state that all the instructions are simple.
If that were true, the ARM would not have a MUL, as you can do the exact
same thing with looping ADDing. No, the RISC concept means the silicon is
simple. It is a simple processor to implement.
I'll leave it as an exercise for the reader to figure out the power of Mathias'
example instruction. It is exactly on par with my example, if not slightly more
so!
For a completion of this summary, and some very good points regarding the ARM
processor, keep reading...
In response to the original version of this text, Matthias Seifert replied with a more
specific and detailed analysis. He has kindly allowed me to reproduce his message
here...
RISC vs ARM
You shouldn't call it "RISC vs CISC" but "ARM vs CISC". For example conditional
execution of (almost) any instruction isn't a typical feature of RISC processors but can
only(?) be found on ARMs. Furthermore there are quite some people claiming that an
ARM isn't really a RISC processor as it doesn't provide only a simple instruction set,
i.e. you'll hardly find any CISC processor which provides a single instruction as
powerful as a
LDREQ R0,[R1,R2,LSR #16]!
Today it is wrong to claim that CISC processors execute the complex instructions
more slowly, modern processors can execute most complex instructions with one
cycle. They may need very long pipelines to do so (up to 25 stages or so with a
Pentium III), but nonetheless they can. And complex instructions provide a big
potential of optimisation, i.e. if you have an instruction which took 10 cycles with the
old model and get the new model to execute it in 5 cycles you end up with a speed
increase of 100% (without a higher clock frequency). On the other hand ARM
processors executed most instruction in a single cycle right from the start and thus
don't have this optimisation potential (except the MUL instruction).
The argument that RISC processors provide more registers than CISC processors isn't
right. Just take a look at the (good old) 68000, it has about the same number of
registers as the ARM has. And that 80x86 compatible processors don't provide more
registers is just a matter of compatibility (I guess). But this argument isn't completely
wrong: RISC processors are much simpler than CISC processors and thus take up
much less space, thus leaving space for additional functionality like more registers.
On the other hand, a RISC processor with only three or so registers would be a pain to
program, i.e. RISC processors simply need more registers than CISC processors for
the same job.
And the argument that RISC processors have pipelining whereas CISCs don't is
plainly wrong. I.e. the ARM2 hadn't whereas the Pentium has...
The advantages of RISC against CISC are those today:

RISC processors are much simpler to build, by this again results in the
following advantages:
o easier to build, i.e. you can use already existing production facilities
o much less expensive, just compare the price of a XScale with that of a
Pentium III at 1 GHz...
o less power consumption, which again gives two advantages:
 much longer use of battery driven devices
 no need for cooling of the device, which again gives to
advantages:
 smaller design of the whole device
 no noise

RISC processors are much simpler to program which doesn't only help the
assembler programmer, but the compiler designer, too. You'll hardly find any
compiler which uses all the functions of a Pentium III optimally...
And then there are the benefits of the ARM processors:

Conditional execution of most instructions, which is a very powerful thing
especially with large pipelines as you have to fill the whole pipeline every
time a branch is taken, that's why CISC processors make a huge effort for
branch prediction

The shifting of registers while other instructions are executed which mean that
shifts take up no time at all (the 68000 took one cycle per bit to shift)

The conditional setting of flags, i.e. ADD and ADDS, which becomes
extremely powerful together with the conditional execution of instructions







The free use of offsets when accessing memory, i.e.
LDR
LDR
LDR
LDR
LDR
LDR
...
R0,[R1,#16]
R0,[R1,#16]!
R0,[R1],#16
R0,[R1,R2]
R0,[R1,R2]!
R0,[R1],R2
The 68000 could only increase the address register by the size of the data read
(i.e. by 1, 2 or 4). Just imagine how much better an ARM processor can be
programmed to draw (not only) a vertical line on the screen.







The (almost) free use of all registers with all instructions (which may well be
an advantage of any RISC processor). It simply is great to be able to use
ADD PC,PC,R0,LSL #2
MOV R0,R0
B R0is0
B R0is1
B R0is2
B R0is3
...
or even
ADD PC,PC,R0,LSL #3
MOV R0,R0
MOV R1,#1
B Continue
MOV R2,#2
B Comtinue
MOV R2,#4
B Continue
MOV R2,#8
B Continue
...
I used this technique when programming my C64 emulator even more
excessively to emulate the 6510. There the shift is 8 which gives 256 bytes for
each instruction to emulate. Within those 256 bytes there is not only the code
for the emulation of the instruction but also the code to react on interrupts, the
fetching of the next instruction and the jump to the emulation code of that
instruction, i.e. the code to emulate the CLC (clear C flag) looks like this:
ADD
R10,R10,#1
point to next
BIC
R6,R6,#1
register
LDR
R0,[R12,#64]
CMP
R0,#0
BNE
&00018040
handler
LDRB
R1,[R4,#1]!
ADD
PC,R5,R1,LSL #8
MOV
R0,R0
the 256 bytes
; increment PC of 6510 to
; instruction
; clear C flag of 6510 status
; read 6510 interrupt state
; interrupt occurred?
; yes -> jump to interrupt
; read next instruction
; jump to emulation code
; lots of these to fill up
This means that there is only one single jump for each instruction emulated.
By this (and a bit more) the emulator is able to reach 76% of the speed of the
original C64 with an A3000, 116% with an A4000, 300% with an A5000 and
3441% with my RiscPC (SA at 287 MHz). The code may look hard to handle,
but the source of it looks much better:
;-----------;
; $18 - CLC ;
;-----------;
ADD R10,R10,#1
BIC R6,R6,#%00000001
register
FNNextCommand
FNFillFree
; increment PC of 6510
; clear C flag of 6510 status
; do next command
; fill remaining space
My reply to his reply (!)
The RISC/CISC debate continues. Looking in a few books, it would seem to come
down to whether or not microcode is used - thus RISC or CISC is determined more by
the actual physical design of the processor than by what instructions or how many
registers it offers. This would support the view that some maintain that the 6502 was
an early RISC processor. But I'm not going there...
My other comment... 3441%. Wow.
Return to assembler index
Copyright © 2002 Richard Murray
Processor Types
ARM 1 (v1)
This was the very first ARM processor. Actually, when it was first manufactured in
April 1985, it was the very first commercial RISC processor. Ever.
As a testament to the design team, it was "working silicon" in it's first incarnation, it
exceeded it's design goals, and it used less than 25,000 transistors.
The ARM 1 was used in a few evaluation systems on the BBC micro (Brazil - BBC
interfaced ARM), and a PC machine (Springboard - PC interfaced ARM).
It is believed a large proportion of Arthur was developed on the Brazil hardware.
In essence, it is very similar to an ARM 2 - the differences being that R8 and R9 are
not banked in IRQ mode, there's no multiply instruction, no LDR/STR with registerspecified shifts, and no co-processor gubbins.
ARM evaluation system for BBC Master
(original picture source not known - downloaded from a website full of BBC-related images
this version created by Rick Murray to include zoomed-up ARM down the bottom...)
ARM 2 (v2)
Experience with the ARM 1 suggested improvements that could be made. Such
additions as the MUL and MLA instructions allowed for real-time digital signal
processing. Back then, it was to aid in generating sounds. Who could have predicted
exactly how suitable to DSP the ARM would be, some fifteen years later?
In 1985, Acorn hit hard times which led to it being taken over by Olivetti. It took two
years from the arrival of the ARM to the launch of a computer based upon it...
...those were the days my friend, we thought they'd never end.
When the first ARM-based machines rolled out, Acorn could gladly announce to the
world that they offered the fastest RISC processor around. Indeed, the ARM processor
kicked ass across the computing league tables, and for a long time was right up there
in the 'fastest processors' listings. But Acorn faced numerous challenges. The
computer market was in disarray, with some people backing IBM's PC, some the
Amiga, and all sorts of little itty-bitty things. Then Acorn go and launch a machine
offering Arthur (which was about as nice as the first release of Windows) which had
no user base, precious little software, and not much third party support. But they
succeeded.
The ARM 2 processor was the first to be used within the RISC OS platform, in the
A305, A310, and A4x0 range. It is an 8MHz processor that was used on all of the
early machines, including the A3000. The ARM 2 is clocked at 8MHz, which
translates to approximately four and a half million instructions per second (0.56
MIPS/MHz).
No current image - can you help?
ARM 3 (v2as)
Launched in 1989, this processor built on the ARM 2 by offering 4K of cache
memory and the SWP instruction. The desktop computers based upon it were
launched in 1990.
Internally, via the dedicated co-processor interface, CP15 was 'created' to provide
processor control and identification.
Several speeds of ARM 3 were produced. The A540 runs a 26MHz version, and the
A4 laptop runs a 24MHz version. By far the most common is the 25MHz version used
in the A5000, though those with the 'alpha variant' have a 33MHz version.
At 25MHz, with 12MHz memory (a la A5000), you can expect around 14 MIPS (0.56
MIPS/MHz).
It is interesting to notice that the ARM3 doesn't 'perform' faster - both the ARM2 and
the ARM3 average 0.56 MIPS/MHz. The speed boost comes from the higher clock
speed, and the cache.
Oh, and just to correct a common misunderstanding, the A4 is not a squashed down
version of the A5000. The A4 actually came first, and some of the design choices
were reflected in the later A5000 design.
ARM3 with FPU
(original picture downloaded from Arcade BBS, archive had no attribution)
ARM 250 (v2as)
The 'Electron' of ARM processors, this is basically a second level revision of the
ARM 3 design which removes the cache, and combines the primary chipset (VIDC,
IOC, and MEMC) into the one piece of silicon, making the creation of a
cheap'n'cheerful RISC OS computer a simple thing indeed. This was clocked at
12MHz (the same as the main memory), and offers approximately 7 MIPS (0.58
MIPS/MHz).
This processor isn't as terrible as it might seem. That the A30x0 range was built with
the ARM250 was probably more a cost-cutting exercise than intention. The ARM250
was designed for low power consumption and low cost, both important factors in
devices such as portables, PDAs, and organisers - several of which were developed
and, sadly, none of which actually made it to a release.
No current image - can you help?
ARM 250 mezzanine
This is not actually a processor. It is included here for historical interest. It seems the
machines that would use the ARM250 were ready before the processor, so early
releases of the machine contained a 'mezzanine' board which held the ARM 2, IOC,
MEMC, and VIDC.
ARM 4 and ARM 5
These processors do not exist.
More and more people began to be interested in the RISC concept, as at the same sort
of time common Intel (and clone) processors showed a definite trend towards higher
power consumption and greater need for heat dissipation, neither of which are friendly
to devices that are supposed to be running off batteries.
The ARM design was seen by several important players as being the epitome of sleek,
powerful RISC design.
It was at this time a deal was struck between Acorn, VLSI (long-time manufacturers
of the ARM chipset), and Apple. This lead to the death of the Acorn RISC
Microprocessor, as Advanced RISC Machines Ltd was born. This new company was
committed to design and support specifically for the processor, without the hassle and
baggage of RISC OS (the main operating system for the processor and the desktop
machines). Both of those would be left to Acorn.
In the change from being a part of Acorn to being ARM Ltd in it's own right, the
whole numbering scheme for the processors was altered.
ARM 610 (v3)
This processor brought with it two important 'firsts'. The first 'first' was full 32 bit
addressing, and the second 'first' was the opening for a new generation of ARM based
hardware.
Acorn responded by making the RiscPC. In the past, critics were none-too-keen on the
idea of slot-in cards for things like processors and memory (as used in the A540), and
by this time many people were getting extremely annoyed with the inherent memory
limitations in the older hardware, the MEMC can only address 4Mb of memory, and
you can add more by daisy-chaining MEMCs - an idea that not only sounds hairy, it is
hairy!
The RiscPC brought back the slot-in processor with a vengeance. Future 'better'
processors were promised, and a second slot was provided for alien processors such as
the 80486 to be plugged in. As for memory, two SIMM slots were provided, and the
memory was expandable to 256Mb. This does not sound much as modern PCs come
with half that as standard. However you can get a lot of milage from a RiscPC fitted
with a puny 16Mb of RAM.
But, always, we come back to the 32 bit. Because it has been with us and known
about ever since the first RiscPC rolled out, but few people noticed, or cared. Now as
the new generation of ARM processors drop the 26 bit 'emulation' modes, we RISC
OS users are faced with the option of getting ourselves sorted, or dying.
Ironically, the other mainstream operating systems for the RiscPC hardware - namely
ARMLinux and netbsd/arm32 are already fully 32 bit.
Several speeds were produced; 20MHz, 30Mhz, and the 33MHz part used in the
RiscPC.
The ARM610 processor features an on-board MMU to handle memory, a 4K cache,
and it can even switch itseld from little-endian operation to big-endian operation. The
33MHz version offers around 28MIPS (0.84 MIPS/MHz).
The RiscPC ARM610 processor card
(original picture by Rick Murray, © 2002)
ARM 710 (v3)
As an enhancement of the ARM610, the ARM 710 offers an increased cache size (8K
rather than 4K), clock frequency increased to 40MHz, improved write buffer and
larger TLB in the MMU.
Additionally, it supports CMOS/TTL inputs, Fastbus, and 3.3V power but these
features are not used in the RiscPC.
Clocked at 40MHz, it offers about 36MIPS (0.9 MIPS/MHz); which when combined
with the additional clock speed, it runs an appreciable amount faster than the ARM
610.
ARM710 side by side with an 80486, the coin is a British 10 pence coin.
(original picture by Rick Murray, © 2001)
ARM 7500
The ARM7500 is a RISC based single-chip computer with memory and I/O control
on-chip to minimise external components. The ARM7500 can drive LCD
panels/VDUs if required, and it features power management. The video controller can
output up to a 120MHz pixel rate, 32bit sound, and there are four A/D convertors onchip for connection of joysticks etc.
The processor core is basically an ARM710 with a smaller (4K) cache.
The video core is a VIDC2.
The IO core is based upon the IOMD.
The memory/clock system is very flexible, designed for maximum uses with
minimum fuss. Setting up a system based upon the ARM7500 should be fairly simple.
ARM 7500FE
A version of the ARM 7500 with hardware floating point support.
ARM7500FE, as used in the Bush Internet box.
(original picture by Rick Murray, © 2002)
StrongARM / SA110 (v4)
The StrongARM took the RiscPC from around 40MHz to 200-300MHz and showed a
speed boost that was more than the hardware should have been able to support. Still
severely bottlednecked by the memory and I/O, the StrongARM made the RiscPC fly.
The processor was the first to feature different instruction and data caches, and this
caused quite a lot of self-modifying code to fail including, amusingly, Acorn's own
runtime compression system. But on the whole, the incompatibilities were not more
painful than an OS upgrade (anybody remember the RISC OS 2 to RISC OS 3
upgrade, and all the programs that used SYS OS_UpdateMEMC, 64, 64 for a speed
boost froze the machine solid!).
In instruction terms, the StrongARM can offer half-word loads and stores, and signed
half-word and byte loads and stores. Also provided are instructions for multiplying
two 32 bit values (signed or unsigned) and replying with a 64 bit result. This is
documented in the ARM assembler user guide as only working in 32-bit mode,
however experimentation will show you that they work in 26-bit mode as well. Later
documentation confirms this.
The cache has been split into separate instruction and data cache (Harvard
architecture), with both of these caches being 16K, and the pipeline is now five stages
instead of three.
In terms of performance... at 100MHz, it offers 114MIPS which doubles to 228MIPS
at 200MHz (1.14 MIPS/MHz).
A StrongARM mounted on a LART board.
In order to squeeze the maximum from a RiscPC, the Kinetic includes fast RAM on
the processor card itself, as well as a version of RISC OS that installs itself on the
card. Apparently it flies due to removing the memory bottleneck, though this does
cause 'issues' with DMA expansion cards.
A Kinetic processor card.
SA1100 variant
This is a version of the SA110 designed primarily for portable applications. I mention
it here as I am reliably informed that the SA1100 is the processor inside the 'faster'
Panasonic satellite digibox. It contains the StrongARM core, MMU, cache, PCMCIA,
general I/O controller (including two serial ports), and a colour/greyscale LCD
controller. It runs at 133MHz or 200MHz and it consumes less than half a watt of
power.
Thumb
The Thumb instruction set is a reworking of the ARM set, with a few things omitted.
Thumb instructions are 16 bits (instead of the usual 32 bit). This allows for greater
code density in places where memory is restricted. The Thumb set can only address
the first eight registers, and there are no conditional execution instructions. Also, the
Thumb cannot do a number of things required for low-level processor exceptions, so
the Thumb instruction set will always come alongside the full ARM instruction set.
Exceptions and the like can be handled in ARM code, with Thumb used for the more
regular code.
Other versions
These versions are afforded less coverage due, mainly, to my not owning nor having
access to any of these versions.
While my site started as a way to learn to program the ARM under RISC OS, the
future is in embedded devices using these new systems, rather than the old 26 bit
mode required by RISC OS...
...and so, these processors are something I would like to detail, in time.
M variants
This is an extension of the version three design (ARM 6 and ARM 7) that provides
the extended 64 bit multiply instructions.
These instructions became a main part of the instruction set in the ARM version 4
(StrongARM, etc).
T variants
These processors include the Thumb instruction set (and, hence, no 26 bit mode).
E variants
These processors include a number of additional instructions which provide improved
performance in typical DSP applications. The 'E' standing for "Enchanced DSP".
The future
The future is here. Newer ARM processors exist, but they are 32 bit devices.
This means, basically, that RISC OS won't run on them until all of RISC OS is
modified to be 32 bit safe. As long as BASIC is patched, a reasonable software base
will exist. However all C programs will need to be recompiled. All relocatable
modules will need to be altered. And pretty much all assembler code will need to be
repaired. In cases where source isn't available (ie, anything written by Computer
Concepts), it will be a tedious slog.
It is truly one of the situations that could make or break the platform.
I feel, as long as a basic C compiler/linker is made FREELY available, then we should
go for it. It need not be a 'good' compiler, as long as it will be a drop-in replacement
for Norcroft CC version 4 or 5. Why this? Because RISC OS depends upon
enthusiasts to create software, instead of big corporations. And without inexpensive
reasonable tools, they might decide it is too much to bother with converting their
software, so may decide to leave RISC OS and code for another platform.
I, personally, would happily download a freebie compiler/linker and convert much of
my own code. It isn't plain sailing for us - think of all of the library code that needs to
be checked. It will be difficult enough to obtain a 32 bit machine to check the code
works correctly, never mind all the other pitfalls. Asking us for a grand to support the
platform is only going to turn us away in droves. Heck, I'm still using ARM 2 and
ARM 3 systems. Some of us smaller coders won't be able to afford such a radical
upgrade. And that will be VERY BAD for the platform. Look how many people use
the FREE user-created Internet suite in preference to commercial alternatives. Look at
all of the support code available on Arcade BBS. Much of that will probably go, yes.
But would a platform trying to re-establish itself really want to say goodbye to the
rest?
I don't claim my code is wonderful, but if only one person besides myself makes good
use of it - then it has been worth it.
Click here to learn more on 32 bit operation
Return to assembler index
Copyright © 2004 Richard Murray
The Stack
The 6502 microprocessor features support for a stack, located at &1xx in memory, and
extending for 256 bytes. It also featured instructions which performed instructions
more quickly relative to page zero (&0xx).
Both of these are inflexible, and not in keeping with the RISC concept.
The ARM processor provides instructions for manipulating the stack (LDM and
STM). The actual location where your stack lays it's hat is entirely up to you and the
rules of good programming.
For example:
MOV
STMFD
R13, #&8000
R13!, {R0-R12, R14}
would work, but is likely to scribble your registers over something important. So
typically you would set R13 to the end of your workspace, and stack backwards from
there.
These are conventions used in RISC OS. You can replace R13 with any register
except R14 (if you need it) and R15. As R14 and R15 have a defined purpose, the
next register down is R13, so that is used as the stack pointer.
Likewise, in RISC OS, the stacks are fully descending (FD, or IA) which means the
stack grows downwards in memory, and the updated stack pointer points to the next
free location.
You can, quite easily, shirk convention and stack using whatever register you like
(R0-R13 and R14 if you don't need it) and also you can set up any kind of stack you
like, growing up, growing down, pointer to next free or last used... But be aware that
when RISC OS provides you with stack information (if you are writing a module,
APCS assembler, BASIC assembler, or being a transient utility, for example) it will
pass the address in R13 and expect you to be using a fully descending stack. So while
you can use whatever type of stack/location that suits you, it is suggested you follow
the OS style. It makes life easier.
If you are not sure what a stack is, exactly, then consider it a temporary dumping area.
When you start your program, you will want to put R14 somewhere so you know
where to branch to in order to exit. Likewise, every time you BL, you will want to put
R14 someplace if you plan to call another BL.
To make this clearer:
; ...entry, R14 points to exit location
BL
one
BL two
MOV PC, R14
; exit
.one
; R14 points to instruction after 'BL one'
...do stuff...
MOV PC, R14
; return
.two
; R14 points to instruction after 'BL two'
...do stuff...
BL three
MOV PC, R14
; return
.three
; R14 points to instruction after 'BL three'
B
four
; no return
.four
; Not a BL, so R14 unchanged
MOV PC, R14
; returns from .three because R14 not changed.
Take a moment to work through that code. It is fairly simple. And fairly obvious is
that something needs to be done with R14, otherwise you won't be able to exit. Now,
a viable answer is to shift R14 into some other register. So now consider that the "...do
stuff..." parts use ALL of the remaining registers.
Now what? Well, what we need is a controlled way to dump R14 into memory until
we come to need it.
That's what a stack is.
That code again:
; ...entry, R14 points to exit location, we assume R13 is set up
STMFD
BL
BL
LDMFD
R13!, {R14}
one
two
R13!, {PC} ; exit
.one
; R14
STMFD
...do
LDMFD
points to instruction after 'BL one'
R13!, {R14}
stuff...
R13!, {PC} ; return
.two
; R14
STMFD
...do
BL
LDMFD
points to instruction after 'BL two'
R13!, {R14}
stuff...
three
R13!, {PC} ; return
.three
; R14 points to instruction after 'BL three'
B
four
; no return
.four
; Not a BL, so R14 unchanged
LDMFD R13!, {PC} ; returns from .three because R14 not changed.
A quick note, you can write:
STMFD
...do
LDMFD
MOV
R13!, {R14}
stuff...
R13!, {R14}
PC, R14
but the STM/LDM does NOT keep track of which stored values belong in which
registers, so you can store R14, and reload it directly into PC thus disposing of the
need to do a MOV afterwards.
The caveat is that the registers are saved in ascending order...
STMFD R13!, {R7, R0, R2, R1, R9, R3, R14}
will save R0, R1, R2, R3, R7, R9, and R14 (in that order). So code like:
STMFD R13!, {R0, R1}
LDMFD R13!, {R1, R0}
to swap two registers will not work.
Please refer to this document for details on STM/LDM and how to use a stack.
Return to assembler index
Copyright © 2004 Richard Murray
Memory Management
Introduction
The RISC OS machines work with two different types of memory - logical and
physical.
The logical memory is the memory as seen by the OS, and the programmer. Your
application begins at &8000 and continues until &xxxxx.
The physical memory is the actual memory in the machine.
Under RISC OS, memory is broken into pages. Older machines have a page of
8/16/32K (depending on installed memory), and newer machines have a fixed 4K
page. If you were to examine the pages in your application workspace, you would
most likely see that the pages were seemingly random, not in order. The pages relate
to physical memory, combined to provide you with xxxx bytes of logical memory.
The memory controller is constantly shuffling memory around so that each task that
comes into operation 'believes' it is loaded at &8000. Write a little application to
count how many wimp polls occur every second, you'll begin to appreciate how much
is going on in the background.
MEMC : Older systems
In ARM 2, 250, and 3 machines; the memory is controlled by the MEMC (MEMory
Controller). This unit can cope with an address space of 64Mb, but in reality can only
access 4Mb of physical memory. The 64Mb space is split into three sections:
0Mb - 32Mb
32Mb - 48Mb
48Mb - 64Mb
:
:
:
Logical RAM
Physical RAM
System ROMs and I/O
Parts of the system ROMs and I/O are mapped over each other, so reading from it
gives you code from ROM, and writing to it updates things like the VIDC
(video/sound).
It is possible to fit up to 16Mb of memory to an older machine, but you will need a
matched MEMC for each 4Mb. People have reported that simply fitting two MEMCs
(to give 8Mb) is either hairy or unreliable, or both. In practice, the hardware to do this
properly only really existed for the A540 machine, where each 4Mb was a slot-in
memory card with an on-board MEMC. Other solutions for, say, the A5000 and the
A410, are elaborate bodges. Look at http://www.castle.org.uk/castle/upg25.htm for an
example of what is required to fit 8Mb into an A5000!
The MEMC is capable of restricting access to pages of memory in certain ways, either
complete access, no access, no access in USR mode, or read-only access. Older
versions of RISC OS only implemented this loosely, so you need to be in SVC mode
to access hardware directly but you could quite easily trample over memory used by
other applications.
MMU : Newer systems
The newer systems, with ARM6 or later processor, have an MMU built into the
processor. This consists of the translation look-aside buffer (TLB), access control
logic, and translation table walk logic. The MMU supports memory accesses based
upon 1Mb sections or 4K pages. The MMU also provides support for up to 16
'domains', areas of memory with specific access rights.
The TLB caches 64 translated entries. If the entry is for a virtual address, the control
logic determines if access is permitted. If it is, the MMU outputs the appropriate
physical address otherwise is signals the processor to abort.
If the TLB misses (it doesn't contain an entry for the virtual address), the walk logic
will retrieve the translation information from the (full) translation table in physical
memory.
If the MMU should be disabled, the virtual address is output directly as the physical
address.
It gets a lot more complicated, suffice to say that more access rights are possible and
you can specify memory to be bufferable and/or cacheable (or not), and the page size
is fixed to 4K. A normal RiscPC offers two banks of RAM, and is capable of
addressing up to 256Mb of RAM in fairly standard PC-style SIMMs, plus up to 2Mb
of VRAM double-ported with the VIDC, plus hardware/ROM addressing.
On the RiscPC, the maximum address space of an application is 28Mb. This is not a
restriction of the MMU but a restriction in the 26-bit processor mode used by RISC
OS. A 32-bit processor mode could, in theory, allocate the entire 256K to a single
task.
All current versions of RISC OS are 26-bit.
System limitations
Consider a RiscPC with an ARM610 processor.
The cache is 4K.
The bus speed is 16MHz (note, only slightly faster than the A5000!), and the
hardware does not support burst-mode for memory accesses.
Upon a context switch (ie, making an application 'active') you need to remap it's
memory to begin at &8000 and flush the cache.
I'll leave you to do the maths. :-)
Memory schemes
and
multitasking
Introduction
This is a reference, designed to help you understand the various types of memory
handling and multitasking that exist.
Memory is a resource that needs careful management. It is expensive (£/Mb is much
higher for memory than for conventional harddisc storage). A good system will offer
flexible facilities trading off speed for functionality.
You need memory because it is fast. It is rarely as fast as the processor, these days,
but it is faster than harddiscs. Because we need fast. We need big, so we can hold
these large programs and large amounts of data that seem to be around. It boggles the
mind that a commercial mainframe did accounts and stuff with a mere 4K of memory.
Typically, there will be three or four, possibly five, kinds of storage in the computer.
1. Level 1 cache
This is inside the processor, usually operating at the core speed of the
processor. It is between 4K and 32K usually.
2. Level 2 cache
If the difference between the processor speed and system memory is quite
large, you will often have a level 2 cache. This is mounted on the
motherboard, and typically runs at a speed roughly halfway between the
processor speed and the speed of the system memory.
It is usually between 64K and 512K. RISC OS machines do not have Level 2
cache.
3. Level 3 cache
If your processor is running at some silly speed (such as 1GHz) and your
system memory is running at a tenth of that, you might like a chunk (say a Mb
or two) of cache between level 2 and system memory, so that you can further
improve speed.
Each layer of cache is getting slower, until we reach...
4. System memory
Your DRAM, SRAM, SIMMs, DIMMs, or whatever you have fitted. Speeds
range from 2MHz in the old home computers, to around 133MHz in a typical
PC compatible. Older PCs use 33MHz or 66MHz buses.
The ARM2/250 machines have an 8MHz bus, the ARM3 machines (A5000,...)
have a 12MHz bus, the RiscPC has a 16MHz bus. In these cases, only the
ARM2 is clocked at the same speed as the bus. The ARM3 is clocked at 25 or
30MHz, the ARM610 at 33MHz, the ARM710 at 40MHz and the StrongARM
at a variety of speeds up to 280-ish MHz.
5. Harddisc
Slow, huge, cheap.
Basic monoprogramming
This is where all of the memory is just available, and you run one application at a
time. The kernel/OS/BIOS (whatever) sits in one place, either in RAM or ROM and it
is mapped into the address map.
Consider:
.----------------.
|
OS in ROM
|
|
|
|----------------|
|
|
|
Your
|
| application
|
|
|
|----------------|
|System workspace|
'----------------'
.----------------.
| Device drivers |
|
in ROM
|
|----------------|
|
|
|
Your
|
| application
|
|----------------|
|
|
|
OS in RAM
|
'----------------'
The first example is similar to the layout of the BBC microcomputer. The second is
not that different to a basic MS-DOS system, the OS is loaded low in memory, the
BIOS is mapped in at the top, and the application sits in the middle.
To be honest, the first example is used a lot under RISC OS as well. It is exactly what
a standard application is supposed to believe. The OS uses page zero (&0000 &7FFF) for internal housekeeping, it (your app) begins at &8000, and the
hardware/OS sit way up in the ether at &3800000.
Memory management under RISC OS is more complex, but this is how a typical
application will see things.
When the memory is organised in this way, only one application can be running.
When the user enters a command, if it is an application then that application is copied
from disc into memory, then it is executed. When the application is done with, the
operating system reappears, waiting for you to give it something else to do.
Basic multiprogramming
Here, we are running several applications. While they are not running concurrently (to
do so would be impossible, a processor can only do one thing at a time), the amount
of time given to an application is tiny, so the system is spending a lot of time faffing
around hopping from one application to the next, all giving you the illusion that n
applications are all happily running together on your computer.
Memory is typically handled as non-contiguous blocks. On an ARM machine, pages
are brought together to fake a chunk of memory beginning at &8000. Anybody who
has tried an address translation in their allocated memory will know two things.
Firstly, it is near impossible to get an actual physical memory address out of the OS.
The following program demonstrates this:
END = &10000 : REM Constrain slot to 32K
DIM willow% 16
SYS "Wimp_SlotSize", -1, -1 TO slot%
SYS "OS_ReadMemMapInfo" TO page%
PRINT "Using "+STR$(slot% / page%)+" pages, each page being
"+STR$(page%)+" bytes."
PRINT "Pages used: ";
more% = slot% / page%
FOR loop% = 0 TO (more% - 1)
willow%!0 = 0
willow%!4 = &8000 + (loop% * page%)
willow%!8 = 0
willow%!12= -1
SYS "OS_FindMemMapEntries", willow%
IF loop% > 0 THEN PRINT ", ";
PRINT STR$(willow%!0);
NEXT
PRINT
END
This outputs something similar to:
Using 8 pages, each page being 4096 bytes.
Pages used: 2555, 2340, 2683, 2682, 2681, 2680, 2679, 2678
RISC OS handles memory by loading everything into memory. These applications are
then 'paged in' by remapping the memory pointers in the page tables, consequently,
other tasks are mapped out.
Windows/Unix systems load applications into memory, supported by a system called
'virtual memory' which dumps unused pages to disc in order to free system memory
for applications that need it. I am not sure how Windows organises its memory, if it
does it in a style similar to RISC OS (ie, remap to start from a specific address) or if
each application is just told 'you are here'.
Virtual memory is useful, as you can fit a 32Mb program into 16Mb of memory if you
are careful how you load it, and swap out old parts for new parts as necessary.
Some systems use a lazy-paging form of memory. In this case, only the first page of
memory is filled by the application when execution starts. As more of the application
is executed, the operating system fills in the parts as required.
By contrast, under RISC OS an application needs to load. Consider loading, well,
practically anything, off of floppy disc. It takes time.
Virtual memory
When you no longer have actual physical memory, you may have virtual memory. A
set of memory locations that don't exist, but the operating system tries real hard to
convince you they do. And in the centre of the ring is the MMU (Memory
Management Unit, inspired name, no?) keeping control
[note: you need an MMU anyway when your memory is broken into remappable
pages, this just seemed like a good time to introduce it!]
When the processor is instructed to jump to &8000 to begin executing an application,
it passes the address &8000 to the MMU. This translates the address into the correct
real address and outputs this on the address lines, say &12FC00. The processor is not
aware of this, the application is not aware of this, the computer user is not aware of
this.
So we can take this one stage further by mapping onwards into memory that does not
exist at all. In this case, the MMU will hiccup and say "Oi! You! No!" and the
operating system will be called in a panic (correctly known as a "page fault"). The
operating system will be calm and collected and think, "Ah, virtual memory". A littleused page of real memory will be shoved out to disc, then the page that the MMU was
trying to find will be reloaded in place of the page we just got rid of. The memory
map will be updated accordingly, then control will be handed back to the user
application at the exact point the page fault occured. It would, unknowing of all of this
palaver, perform that instruction again, only this time the MMU will (happily?) output
the correct address to the memory system, and all will continue.
Page tables and the MMU
The page table exists to map each page into an address. This allows the operating
system to keep track of which memory is pretending to be which. However it is more
complex. Some pages cannot be remapped, some pages are doubly mapped, some are
not to be touched in user mode code, some aren't to be touched at all. Some are read
only. Some just don't exist. All of this must be kept track of.
So the MMU takes an address, looks it up in the page table, and spits out the correct
address.
Let's do some maths. We'll assume a 4K page size (a la RISC OS in a RiscPC). A
32bit address space has a million pages. With one million pages, you'll need one
million entries. In the ARM MMU, each entry takes 7 words. So we are looking at
seven megabytes just to index our memory.
It gets better. Every single memory reference will be passed through the MMU. So
we'll want it to operate in nanoseconds. Faster, if possible.
In reality, it is somewhat easier as most typical machines don't have enough memory
to fill the entire addressing space, indeed many are unlikely to get close on technical
reasons (the RiscPC can have 258Mb maximum RAM, or 514Mb with Kinetic - the
extra 2Mb is the VRAM). Even so, the page tables will get large.
So there are three options:



Have a huge array of fast registers in the MMU. Costly. Very.
Hold the page tables in main memory. Slow. Very.
Compromise. Cache the active pages in the MMU, and store the rest on disc.
An example. A RiscPC, 64Mb of RAM, 2Mb of VRAM, 4Mb of ROM and hardware
I/O (double mapped). That's 734000320 bytes, or 17920 pages. It would take 71680
bytes to store each address. But an address on it's own isn't much use. Seven words
comprise an entry in the ARM's MMU. So our 17920 pages would require 501760
bytes in order to fully index the memory.
You just can't store that lot in the MMU. So you'll store a snippet, say 16K worth?,
and keep the rest in RAM.
The TLB
The Translation Lookaside Buffer is a way to make paging even more responsive.
Typically, a program will make heavy use of a few pages and barely touch the rest.
Even if you plan to byte read the entire memory map, you will be making four
thousand hits in one page before going to the next.
A solution to this is to fit a little bit in the MMU that can map virtual addresses to
their physical counterparts without traversing the page table. This is the TLB. It lives
within the MMU and contains details of a small number of pages (usually between
four and sixty four - the ARM610 MMU TLB has thirty two entries).
Now, when we have a page lookup, we first pass our virtual address to the TLB which
will check all of the addresses stored, and the protection level. If a match is found, the
TLB will spit out the physical address and the page table isn't touched.
If a miss is encountered, then the TLB will evict one of it's entries and load in the
page information looked up in the page table, so the TLB will know the new page
requested, so it can quickly satisfy the result for the next memory access, as chances
are the next access will be in the page just requested.
So far we have figured on the hardware doing all of this, as in the ARM processor.
Some RISC processors (such as the Alpha and the MIPS) will pass the TLB miss
problem to the operating system. This may allow the OS to use some intelligence to
pre-load certain pages into the TLB.
Page size
Users of an RISC OS 3.5 system running on an ARM610 with two or more large (say,
20Mb) applications running will know the value of a 4K page. Because it's bloody
slow. To be fair, this isn't the fault of the hardware, but more the WIMP doing stuff
the kernel should do (as happens in RISC OS 3.7) and doing it slower!
Like with harddisc LFAUs, what you need is a sensible trade-off between page
granularity and page size. You could reduce the wastage in memory by making pages
small, say 256 bytes. But then you would need a lot of memory to store the page table.
A bigger page table, slower to scan through it. Or you could have 64K pages, which
make the page table small, but can waste huge amounts of memory.
To consider, a 32K program would require eight 4K pages, or sixty four 512 byte
pages. If your system remaps memory when shuffling pages around, it is quicker to
move a smaller number of large pages than a larger number of small pages.
The MEMC in older RISC OS machines had a fixed page table. So the size of page
depended upon how much memory was utilised.
MEMORY
0.5Mb
1Mb
2Mb
4Mb
PAGE SIZE
8K
8K
16K
32K
3Mb wasn't a valid option, and 4Mb is the limit. You can increase this by fitting a
slave MEMC, in which case you are looking at 2 lots of 4Mb (invisible to the
OS/user).
In a RiscPC, the MMU accesses a number of 4K pages. The limits are due, I suspect,
to the system bus or memory system, not the MMU itself.
Most commercial systems use page sizes in the order 512 bytes to 64K.
The later ARM processors (ARM6 onwards) and the Intel Pentium both use page
sizes of 4K.
Page replacement algorithms
When a page fault occurs, the operating system has to pick a page to dump, to allow
the required page to be loaded. There are several ways that this may be achieved.
None of these are perfect, they are a compromise of efficiency.
Not Recently Used
This requires two bits to be reserved in the page table, a bit for read/write and a bit for
page reference. Upon each access, the paging hardware (and it must be done in
hardware for speed) will set the bits as necessary. Then on a fixed interval the
operating system will clear these bits - either when idling or upon clock interrupt?
This then allows you to track the recent page accesses, so when flushing out a page
you can spot those that have not recently been read/written or referenced. NRU would
remove a page at random. While it is not the best way of sorting out which pages to
remove, it is simple and gives reasonably good results.
First-In First-Out
It is hoped you are familiar with the concept of FIFO, from buffering and the like. If
you are not, consider the lame analogy of the hose pipe in which the first water in will
be the first water to come out the other end. It is rarely used, I'll leave the whys and
where-fores as an exercise for the bemused reader. :-)
Second Chance
A simple modification to the FIFO arrangement is to look at the access bit, and if it is
zero then we know the page is not in current use and can be thrown. If the bit is set,
then the page is shifted to the end of the page list as if it was a new access, and the
page search continues.
What we are doing here is looking for a page unused since the last period (clock
tick?). If by some miracle ALL the pages are current and active, then Second Change
will revert to FIFO.
Clock Although Second Chance is good, all that page shuffling is inefficient so the
pages are instead referenced in a circular list (ie, clock). If the page being examined in
in use, we move on and look at the next page. With no concept of the start and end of
the list, we just keep going until we come to a usable page.
Least Recently Used
LRU is possible, but it isn't cheap. You maintain a list of all the pages, sorted by the
most recently used at the front of the list, to the least recently used at the back. When
you need a page, you pull the last entry and use it. Because of speed, this is only really
possible in hardware as the list should be updated each memory access.
Not Frequently Used
In an attempt to simulate LRU in software, we can maintain something vaguely
similar to LRU in a software implementation, in which the OS scans the available
pages on each clock tick and increments a counter (held in memory, one for each
page) depending on the read/written bit.
Unfortunately, it doesn't forget. So code heavily used then no longer necessary (such
as a rendering core) will have a high count for quite a while. Then, code that is not
called often but should be all the more responsive, such as redraw code, will have a
lower count and thus stand the possibility of being kicked out, even though the higherrated renderer is no longer needed but not kicked out as it's count is higher.
But this can be fixed, and the fix emulates LRU quite well. It is called aging. Just
before the count is incremented, it is shifted one bit to the right. So after a number of
shifts the count will be zero unless the bit is added. Here you might be wondering how
adding a bit can work, if you've just shifted a bit off. The answer is simple. The added
bit is added to the leftmost position, ie most significant.
The make this clearer...
Once upon a time:
Clock tick
:
Clock tick
:
Memory accessed :
Clock tick
:
0
0
0
1
0
0
0
0
0
1
1
0
0
0
0
0
1
0
0
0
1
0
1
0
0
1
1
0
1
0
Memory accessed :
1 0 1 0 0 0
Multitasking
There is no such thing as true multitasking (despite what they may claim in the
advocacy newsgroups). To multitask properly, you need a processor per process, with
all the relevant bits so processes are not kept waiting. Effectively, a separate computer
for each task.
However, it is possible to provide the illusion of running several things at once. In the
old days, things happened in the background under interrupt control. Keyboards were
scanned, clocks were updated. As computers became more powerful, more stuff
happened in the background. Hugo Fiennes wrote a soundtracker player that runs on
interrupts, so works in the background. You set it going, it carries on independent of
your code.
So people began to think of the ability to apply this to applications. After all, most of
the time an application is spent waiting for user input. In fact, the application may
easily do sweet sod all for almost 100% of the time - measured by an event counter in
Emily's polling loop, I type ~1 character a second, the RiscPC polls a few hundred
times a second. That was measured in a multitasking application, using polling speed
as a yardstick. Imagine if we were to record loops in a single-tasking program. So the
idea was arrived at. We can load several programs into memory, provide them some
standard facilities and messaging systems, and then let them run for a predefined
duration. When the duration is up, we pass control to the next program. When that has
used its time, we go to the next program, and so on.
As a brief aside, I wish to point out Schrödinger's cat. A rather cute little moggy, but
an extremely important one. It is physically impossible to measure system polling
speed in software, and pretty difficult to measure it in hardware. You see, the very act
of performing your measurement will affect the results. And you cannot easily
'account' for the time taken to make your measurements because measuring yourself is
subject to the same artefacts as when measuring other things. You can only say 'to hell
with it', and have your program report your polling rate as being 379 polls/sec,
knowing that your measuring code may be eating around 20% of the available time,
and use the figures in a relative form rather than trying to state "My computer
achieves 379 polls every second". While there is no untruth in that, your computer
might do 450 if you weren't so busy watching! You simply can't be JAFO.
...and you need to go to school/college and get bored rigid to find out what relevance
any of this has to your cat. Mine is sitting on my monitor, asleep, blissfully unaware
of all these heavy scientific concepts. She's probably got the right idea...
Co-operative multitasking
One such way of multitasking is relatively clean and simple. The application, once
control has passed to it, has full control for as long as it needs. When it has finished,
control is explicitly passed back to the operating system.
This is the multitasking scheme used in RISC OS.
Pre-emptive multitasking
Seen as the cure to all the world's ills by many advocates who have seen Linux (not
Windows!), this works differently. Your application is given a timeslice. You can
process whatever you want in your timeslice. When your timeslice is up, control is
wrested away and given to another process. You have no say in the matter, peon.
I don't wish to get into an advocacy war here. My personal preference is co-operative,
however I don't feel that either is the answer. Rather, a hybrid using both technologies
could make for a clean system. The major drawback of CMT is that if an application
dies and goes into a never-ending loop, control won't come back. The application
needs to be forceably killed off.
Niall Douglas wrote a pre-emption system for RISC OS applications. Surprisingly,
you didn't really notice anything much until an application entered some heavy
processing (say, ChangeFSI) at which point life carried right on as normal while the
task which would have stalled the machine for a while chugged away in the
background.
Return to assembler index
Copyright © 2004 Richard Murray
32 bit operation
A lot of this information is taken from the ARM assembler manual. I didn't have a 32
bit processor at the time, so trusted the documentation...
As it happens, the documentation erroneously stated that UMUL and UMLA could
only be performed in 32bit mode. Well, that is incorrect, if your processor can do it
(ie: StrongARM), it will work in 32bit OR 26bit...
The ARM2 and ARM3 have a 32 bit data bus and a 26 bit address bus. On later
versions of the ARM, both the data bus and the address bus are a full 32 bits wide.
This explains how a "32 bit processor" can be referred to as 26 bit. The data width and
instruction/word size is 32 bit, and always has been, but the address bus is only 24 bit.
Oh, whoops, I said 26 bit, didn't I?
:-) Well, as PC is always word aligned, the lower two bits will always be zero in an
address, so on the ARM2/ARM3 processor these bits hold the processor mode setting.
The width of PC is, effectively, 26 bit even though only 24 bits are actually used.
This is no a problem on the older machines. 4Mb memory was the norm. Some people
upgraded to 8Mb, and 16Mb was the theoretical limit.
However a RiscPC with a 26 bit program counter would not have been possible, as 26
bits only allows you to address %11111111111111111111111100 (or 67108860
bytes, or 64Mb). The RiscPC allows for 258Mb of memory to be installed.
This, incidentally, explains the 28Mb size limit for application tasks; the system is
expected to be compatible with the older RISC OS API.
The majority of the assembler site has been written regarding 26 bit mode of
operation, which is compatible with the versions of RISC OS currently available (ie,
RISC OS 2 to RISC OS 4); though some parts cover 32 bit modes (one example
briefly runs in SVC32!), and I have noted parts of the examples that are 32 bit
unfriendly.
Those with a RiscPC, Mico, RiscStation, A7000 etc have the ability to run a fully 32
bit operating system; indeed ARMLinux is such an operating system. RISC OS is not,
because RISC OS needs, for the moment, to remain compatible with existing versions.
It is the old dichotomy. It is wonderful to have a nice shiny new fully 32 bit version of
RISC OS, but not so good when you realise a lot of your must-have software won't so
much as load!
RISC OS isn't totally 26 bit. Some of the handlers need to work in 32 bit mode;
however it is limited by money (ie, who's going to pay for RISC OS to be fully
converted; and who's going to pay for new development tools to rebuild their code
(PD software is strong on RISC OS)) and also by necessity (ie, lots of people use
Impression but CC is no longer with us; it is quite likely Impression won't work on an
updated RISC OS, so people will not see a necessity to upgrade if their desired
software won't work).
Why is this even an issue?
Newer ARM processors will not support 26 bit operation. Several hybrids were made
(ARM6, ARM7, StrongARM), but time has come to draw the line. You can either add
the complexity of a 26/32 bit system, or you can go 32 bit only and have a simpler,
smaller processor.
Either we go with the flow, or get left behind... So really, this is an issue, and we don't
have a choice.
32 bit architecture
The ARM architecture changed significantly with the introduction of the ARM6
series. Below, we shall describe the differences in behaviour between 26 bit and 32 bit
operation.
In the ARM 6, the program counter was extended to a full 32 bits. As a result:

The PSR had to be separated from the PC into its own register, the CPSR
(Current Program Status Register).

The PSR can no longer be saved with the PC when changing processor modes;
instead, each privileged mode now has an extra register - the SPSR (Saved
Program Status Register) - to hold the previous mode's PSR.

Instructions have been added to use these new status registers.
A further change was the addition of extra privileged processor modes, allowed by the
PSR now having a full 32 bits to use. These modes are used to handle Undefined
instruction and Abort exceptions. Consequently:


Undefined instructions, aborts, and supervisor code no longer have to share the
same mode. This has removed restrictions on Supervisor mode programs
which existed on earlier ARMs.
The availability of these features in the ARM6 series (and other later
compatible chips) is set by one of several on-chip control registers. One of
three processor configurations can be selected:
o 26 bit program and data space. This configuration forces ARM to
operate with a 26 bit address space. In this configuration only the four
26 bit modes are available (refer to the Processor modes description); it
is impossible to select a 32 bit mode.
This configuration is set at reset on all current ARM6 and 7 series
processors.
o
26 bit program space and 32 bit data space. This is the same as the 26
bit program and data space configuration, except that address
exceptions are disabled to allow data transfer operations to access the
full 32 bit address space.
o
32 bit program and data space. This configuration extends the address
space to 32 bits, and introduces major changes to the programmer's
model. In this configuration you can select any of the 26 bit and the 32
bit processor modes (see Processor modes below).
When configured for a 32 bit program and data space, the ARM6 and ARM7 series
support ten overlapping processor modes of operation:

User mode: the normal program execution state
or User26 mode: a 26 bit version

FIQ mode: designed to support a data transfer or channel process
or FIQ26 mode: a 26 bit version

IRQ mode: used for general purpose interrupt handling
or IRQ26 mode: a 26 bit version

SVC mode: a protected mode for the operating system
or SVC26 mode: a 26 bit version

Abort mode (abbreviated to ABT mode): entered after a data or instruction
prefetch abort

Undefined mode (abbreviated to UND mode): entered when an undefined
instruction is executed.
When in a 26 bit processor mode, the programmer's model reverts to that of earlier 26
bit ARM processors. The behaviour is the same as that of the ARM2aS macrocell
with the following alterations:

Address exceptions are only generated by ARM when it is configured for 26
bit program and data space.
In other configurations the OS may still simulate the behaviour of address
exception, using external logic such as a memory management unit to generate
an abort if the 64Mbyte range is exceeded, and converting that abort into an
`address exception trap' for the application.

The new instructions to transfer data between general registers and the
program status registers remain operative. The new instructions can be used by
the operating system to return to a 32 bit mode after calling a binary
containing code written for a 26 bit ARM.

When in a 32 bit program and data space configuration, all exceptions
(including Undefined Instruction and Software Interrupt) return the processor
to a 32 bit mode, so the operating system must be modified to handle them.

If the processor attempts to write to a location between &0 and &1F inclusive
(i.e. the exception vectors), hardware prevents the write operation and
generates a data abort. This allows the operating system to intercept all
changes to the exception vectors and redirect the vector to some veneer code.
The veneer code should place the processor in a 26 bit mode before calling the
26 bit exception handler.
In all other respects, when operating in a 26 bit mode the ARM behaves as like a 26
bit ARM. The relevant bits of the CPSR appear to be incorporated back into R15 to
form the PC/PSR with the I and F bits in bits 27 and 26. The instruction set behaves
like that of the ARM2aS macrocell, with the addition of the MRS and MSR
instructions.
The registers available on the ARM 6 (and later) in 32 bit mode are:
User26
UND
SVC26
FIQ
IRQ26
R0 ----- R0 ----- R0 -------- R0 ----- R1
R1 ----- R1 ----- R1 -------- R1 ----- R2
R2 ----- R2 ----- R2 -------- R2 ----- R2
R3 ----- R3 ----- R3 -------- R3 ----- R3
R4 ----- R4 ----- R4 -------- R4 ----- R4
R5 ----- R5 ----- R5 -------- R5 ----- R5
R6 ----- R6 ----- R6 -------- R6 ----- R6
R7 ----- R7 ----- R7 -------- R7 ----- R7
R8 ----- R8 ----- R8
---- R8
R8_fiq
R9 ----- R9 ----- R9
---- R9
R9_fiq
R10 ---- R10 ---- R10
---- R10
R10_fiq
R11 ---- R11 ---- R11
---- R11
R11_fiq
R12 ---- R12 ---- R12
---- R12
R12_fiq
R13
R13_svc R13_irq
R13_abt R13_und R13_fiq
FIQ26
User
SVC
IRQ
ABT
R0 --
-- R0 ----- R0 ----- R0 ----- R0 -
R1 --
-- R1 ----- R1 ----- R1 ----- R1 -
R2 --
-- R2 ----- R2 ----- R2 ----- R2 -
R3 --
-- R3 ----- R3 ----- R3 ----- R3 -
R4 --
-- R4 ----- R4 ----- R4 ----- R4 -
R5 --
-- R5 ----- R5 ----- R5 ----- R5 -
R6 --
-- R6 ----- R6 ----- R6 ----- R6 -
R7 --
-- R7 ----- R7 ----- R7 ----- R7 -
R8_fiq
R8 ----- R8 ----- R8 ----- R8 -
R9_fiq
R9 ----- R9 ----- R9 ----- R9 -
R10_fiq
R10 ---- R10 ---- R10 ---- R10
R11_fiq
R11 ---- R11 ---- R11 ---- R11
R12_fiq
R12 ---- R12 ---- R12 ---- R12
R13_fiq
R13
R13_svc
R13_irq
R14
R14_svc R14_irq R14_fiq
R14_abt R14_und R14_fiq
--------- R15 (PC / PSR) -----------------------------
R14
R14_svc
R14_irq
--------------------- R15 (PC)
----------------------- CPSR --
--------------------SPSR_svc SPSR_irq
SPSR_abt SPSR_und SPSR_fiq
In short, the 32 bit differences are:

The PC is a full 32 bits wide, and used singularly as a Program Counter.

The PSR is contained within its own register, the CPSR.

Each privileged mode has a private SPSR register in which to save the CPSR.

There are two new privileged modes, each of which has private copies of R13
and R14.
The CPSR and SPSR registers
The allocation of the bits within the CPSR (and the SPSR registers to which it is
saved) is:
31 30 29 28
N Z C V
---
7
I
6
F
-
4
M4
3
M3
2
M2
1
M1
0
M0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
1
1
1
0
1
0
1
0
1
0
1
1
1
User26 mode
FIQ26 mode
IRQ26 mode
SVC26 mode
User mode
FIQ mode
IRQ mode
SVC mode
ABT mode
UND mode
Please refer to the (26 bit) PSR for information on the N, Z, C, V flags and the I and F
interrupt flags.
So what does it mean in practice?
Most ARM code will work correctly. The only things that will not work are any
operations which fiddle with R15 to set the processor status. Unfortunately, this isn't
as easy to fix as it seems.
I examined a 9K program (a MODE 7 teletext frame viewer, written in C) for
potential problems, basically looking for:


A MOVS with R15 as the destination.
Any LDMFD suffixed with the '^' character and loading R15.
About 64 instructions fell into one of these categories.
There is likely to be few ways to make the conversion process automatic. Basically...



How will the system know what is data, and what is code.
Actually, a clever rules-based program should be able to make a fairly good
guess, but is a "fairly good guess" good enough?
There is NO simple instruction replacement. An automatic system probably
could patch in the required instructions and jiggle the code around, but this
could cause unexpected side effects, like an ADR directive no longer being in
range.
It is incredibly hacky. Surely, much better to recompile, or to repair the source
code.
It is NOT easy. Such a small change, but with such far-reaching consequences.
In comp.sys.acorn.programmer, Stewart Brodie answered my query with a hint that
may be useful to people intending to work with 32 bit code:
> How is it possible, if 32 bit code uses MSR/MRS to transfer status
and
> register, and older ARMs don't have those instructions?
> Are we into "black magic" code for this?
You take advantage of the fact that the encodings for MSR and MRS act
as NOPs
on ARM2 and ARM3 ;-) With some careful arrangement, you can write
fairly
tight code.
To refer back to earlier postings, an example of when MOVS pc, lr in
a
32-bit mode is useful (entered in SVC or IRQ mode, IRQs disabled):
ADR
TEQ
LDREQ
MSREQ
ready for MOVS
LDMIA
NOP
LDR
MOVS
set up)
r14, CallBackRegs
PC,PC
r0, [r14, #16*4]
SPSR_cxsf, r0
; The CPSR
; put into SPSR_svc/SPSR_irq
r14, {r0-r14}^
; Restore user registers
r14, [r14, #15*4]
pc, r14
; The pc
; Back we go (32-bit safe - SPSR
(CallBackRegs contains user mode registers: R0-R15, plus the CPSR if
in a
32-bit mode)
Download a 32 bit code scanner (12K)
Where is the example?
In the logical place, in the document describing the processor status register...
What about old stuff for which we don't have
sources?
There are two options...
The first option is a one-time conversion. We can use an intelligent disassembler
(such as D.Ruck's !ARMalyser to provide us with a source of the software, with the
32bit unsafe parts identified. I used this method to cobble together a 32bit version of
one of my modules.
For fairly short things, this will be okay. For large projects... I shudder to think! One
thing to be especially aware of is that some older software uses tricks like popping
flags into 'unused' bits of addresses. A good example here is software that uses bits 027 as an address and bits 28-31 as flags...
1 << 28 = 268435456
What this means, in essence, is that the software will work fine on all older machines
- including the majority of RiscPCs for which 256Mb was the limit of installable
memory.
If, though, we run this on a 512Mb Iyonix (which is no longer out of the realms of
possibility), as soon as it is loaded to an address over 256Mb ... bit 28 will be set!
The code will need to be examined to ensure such things don't occur, and if they do,
it'll need to be worked around.
As far as I'm aware, which APCS-R requires flags to be saved, I've yet to see my C
compiler generate code that depends upon the saving of flags across function calls.
The typical example is:
Note that the N, Z, C and V flags from lr at the instant of entry must be
reinstated; it is not sufficient merely to preserve the PSR across the call.
Consider, a function ProcA which tail continues to ProcB as follows:
CMPS
MOVLT
MOVGE
B
a1, #0
a2, #255
a2, #0
ProcB
If ProcB merely preserves the flags it sees on entry, rather than restoring
those from lr, the wrong flags may be set when ProcB returns direct to
ProcA's caller.
While it has not been my experience that the C compiler generates such code, humans
can. And much worse. This, too, must be taken into account. And all those ORRing
values into R14 to directly twiddle the processor flags (on return)...
The other method is to make a new computer. All we need to to load up a few old
modules, poke our application at 'troublesome' points, force everything to be in an
area of memory that we may consider is 'safe'. Then we let our program loose with the
same sort of critical care that you'd attend to a hungry cat in a room full of budgies...
This, more or less, is what Aemulor does.
But, at a cost.
From the "Inside Aemulor" article on the Foundation RISC User (issue 11; January
2003) CD-ROM, we encounter a very important point:
From RISC OS's perspective, the Aemulor RMA is a normal dynamic area, but
Aemulor remaps the memory at an address below 64Mb so that it becomes
addressable within the 26-bit environment. Because this emulated RMA is
visible to all applications, native 32-bit applications are also restricted to a
maximum size of 28Mb each (as per RISC OS 4) whilst Aemulor is running. It
is hoped that this limitation can be removed with a later version.
Or, as they say: There's no such thing as a free lunch.
Having said that, the use of Aemulor is essential for all those must-have programs that
either cannot sensibly be modernised, or are unlikely to be modernised.
I have heard that somebody is 32bitting Impression Publisher. Well, you know, I
heard once that somebody was porting Mozilla to RISC OS. Who knows, maybe I'm
wrong... :-)
What API changes have there been?
The "Technical information on 26/320bit RISC OS binary interfaces" (v0.2) states:
Many existing APIs do not actually require flag preservation, such as service
call entries. In this case, simply changing MOVS PC... to MOV PC... and LDM
{}^ to LDM {} is sufficient to achieve 32-bit compatibility.
This is possibly worse than useless as it doesn't specify exactly which APIs need it and
which don't. Is it safe to assume that everything not otherwise described is safe?
The best thing to do is get hold of that document and browse through it. Please do not
simply 'assume' that things will work if you simply don't save flags.
Generally, this is the case, but unless you have a RISC OS 3.10 machine to test it on...
Return to assembler index
Copyright © 2004 Richard Murray
Download