Understanding ARMv7a Architecture and Instructions

Hello there and welcome to comms 10015 computer architecture, week 5 lecture 4. Earlier in the week we looked at various general concepts to do with instruction set architecture and instruction set architecture design. I'd like to finish off the week in this lecture by looking at how those general concepts apply within a specific real world instruction set architecture. set architecture, namely ARMv7a. This name is a little cryptic but essentially what we're talking about is version 7 of the ARM instruction set architecture and specifically the variant that they call the A or application profile. The A profile arguably represents the most general purpose variants of ARMv7. It's intended for use in contexts where you might find an operating system, for example on mobile phones or laptops. It contrasts with alternatives such as the mobile profile, so ARMv7M, which is essentially a special purpose cut-down variant intended for use in embedded computing platforms. Versus some of the examples that we've used so far, ARMv7a is a real instruction set architecture, which means it's implemented within a range of real microarchitectures you can actually hold in your hand and make use of. The example on the slide is of the ARM Cortex-A7 microarchitecture, an instance of which forms a central component within early generations of the Raspberry Pi platform. The ARMv7a instruction set architecture is a risk-like design in the sense that it follows a risk-like philosophy in the majority of cases. However, it's also possible to identify cases where it deviates from this philosophy and therefore you could argue that it's not a risk design per se. Either way, it's reasonable to classify ARMv7a as a load-store architecture using a word size of 32 bits. I'm going to try and give you a limited introduction to interesting or relevant features in ARMv7a, structured in the same way as when we looked at instruction set architectures in a general sense. That is, I'm first going to cover the specification of state and then the specification of instructions. Particularly when discussing the specification of instructions, I'm going to make use of ARM assembly language in all of the examples on the slides. It's important to keep in mind that the goal of doing so is simply to show the relationship between how you'd normally express instructions as a programmer and what the execution semantics of those instructions is. The goal definitely isn't that you're an expert assembly language programmer by the end of the lecture, although the associated lab worksheet does explore this topic in some more detail. ARMv7a specifies a 16-entry general-purpose register file. Within an encoding instruction, we refer to those registers using a 4-bit register address or index. But in an assembly language program, we can refer to the registers in one of two different ways, shown on the left and the right-hand side of this slide respectively. On the left-hand side, you can see the registers referred to by index. They're simply named R0 through to R15. On the right hand side, in contrast, you can see those same registers referred to using a more human-readable identifier. For example, A1 through to A4 are so called because they're made use of to store arguments during function calls. On the slides I've written that this is a generalish purpose register file and what I mean by that is that some of the registers clearly do have a special purpose role. You can see for example that register 15 is actually the program counter or to put it different way, in ARMv7a the program counter isn't a special purpose register, it forms part of the general purpose register file. What that means in practice then is that although some of the registers have a special purpose role, they're all generally addressable. That is, any instruction that allows us to specify a register address as some operand can use the program counter as that operand in more or less the same way as any other register. ARMv7a also specifies two special purpose registers called the Current Program Status Register or CPSR and the Saved Program Status Register or SPSR. I'm going to focus on the CPSR, but keep in mind that both have the same format as Illustrator. in the middle of the slides and can be described as allowing us to control or configure execution in various ways and also inspect the status of execution. What that means concretely is that if we write certain bits in the CPSR then this will act to control execution of subsequent instructions in some way. Conversely if we read certain bits from the CPSR then this gives us some information about the execution of previous instructions. In the diagram that describes the CPSR format, you can see pertinent examples towards both the right hand or least significant and the left hand or most significant ends. Bits 0 through to 4 constitute a 5-bit field labelled M. This holds the processor mode and reflects the privilege level in some sense. For example, depending on the processor mode we might either allow or disallow access to certain regions of memory or execution of certain instructions. Bits 28 to 31 on the other hand are four one-bit flags labeled N, Z, C and V. These flags capture the results of any comparisons that might have been performed during execution of previous instructions. Keep in mind that because the CPSR is a special purpose register we can't make use of it in the same way as elements of the general purpose register file. In order to write a value into the CPSR or read a value from it, we need to make use of two special purpose instructions called MRS and MSR. As part of the specification of state, ARMv7a uses a memory model with a byte addressable 2 to the 32 element address space, which you can see illustrated on the slides. Using the same terminology we had previously, this means the access granularity is 8 bits. i.e. each element that we're able to access is 8 bits in size. The addresses we use to do so are 32 bits in size, meaning that they range from 0 through to 2 to the 32 minus 1. This gives us a total address space size of 4 gigabytes. The memory model also includes some rules which govern the semantics of accesses within that address space. You can see some example instances listed on the slide here. The first rule relates to alignment. For example, if we focus on standard 32-bit ARM instructions, the rule says that such instructions must be located at or aligned to a 4-byte boundary. These instructions are allowed to be located at addresses 0, 4 or 8 therefore, but not 1, 3 or 7. The same sort of rules apply to data. For example, 32-bit words, 16-bit half words, and 8-bit bytes must be located at or aligned to 4-byte, 2-byte, or 1-byte boundaries respectively. The second rule relates to endianness. The rule basically says that access to instructions, for instance within the fetch phase of the fetch-decode-execute cycle, will always be little endian. In contrast, access to data, for example within the execute phase of the fetch-decode-execute cycle, could either be little endian or big endian. Although not unique to ARMv7a, this feature is somewhat unusual. The decision between little-endian and big-endian data access isn't made on a per-instruction basis, but instead controlled globally, in some sense setting execution into a little-endian or big-endian mode. By default that mode is little-endian, and that's what we'll assume from here on. Moving on to the specification of instructions then. The first topic to consider is how instructions are encoded and therefore decoded. In short, ARMv7a makes use of a fixed length instruction encoding, where each encoded instruction is 32 bits in length. Looking at the diagram on the slide however, you could argue this is one topic where ARMv7a deviates from a RISC-like philosophy in the sense that it includes many different instruction formats. Some of those formats are relatively complex, which implies the process of encoding and decoding instructions is also relatively complex. Notice that the majority of instruction formats can be described as three address, in the sense that they allow the specification of up to three register address operands. There are exceptions however. Both the short and long multiply formats allow the specification of four register address operands rather than three. Recall we used exactly this example while discussing various challenges related to instruction encoding. Also notice that the standard data processing format is somewhat unusual. Although this is a three address format in the sense that it allows the specification of three register address operands, the encoding is such that the field associated with the second operand, labelled operand here, is 12 bits rather than 4 bits. The purpose and meaning of this is something that we'll come back and explain later in the lecture. As you'd expect, as a general purpose instruction set architecture, ARMv7a includes a range of fairly standard instruction classes. The most straightforward example is the so-called data processing or ALU-like class. This class is straightforward in the sense that selection and semantics of instruction types and variants within it are straightforward. You can see some examples shown on the slide here. The first two such examples show variants of arithmetic type addition instructions. These respectively allow us to add the immediate one, or the contents of general purpose register two, to general purpose register number one, in both cases storing the result in general purpose register number zero. By default at least, these instructions don't update the n, z, c or v flags in the special purpose CPSR register. However, We can enable such updates by adding an s onto the end of the instruction identifier. For example, the instruction add s r0 r1 r2 has the same meaning as add r0 r1 r2, in the sense that it adds general purpose register number one to general purpose register number two and stores the result in general purpose register number zero. However, it also updates those flags in the CPSR register to reflect the result computed. This allows us, for example, to capture whether or not a carry resulted from the associated addition, or whether that addition produced a result that was zero or non-zero. The so-called flexible second operand of any data processing instruction can take one of four forms shown on the slide here. These relate to the slightly unusual instruction format that we discussed earlier. We've already seen examples of forms 1 and 2, because these relate to that second operand either being an immediate or a register address respectively. Forms 3 and 4 are slightly more exotic in the sense that they relate to the idea that that second operand specifies data via some limited form of computation. The idea is that the data that we end up with results from taking the contents of some general purpose register and either shifting or rotating it by some distance. That distance might be specified by an immediate or by a register address operand. One way to at least think about this is that ARMv7a is offering us the ability to fuse together standard data processing instructions with either a shift or a rotate. and as a result of this we get some form of 2 for the price of 1. We're able to actually perform two operations for the price of one instruction. This renders instruction execution more efficient in a range of different situations, such as the computation of addresses, where shifts are commonly applied in order to perform some scaling of a base address or an offset. In addition to the set of what you might call computational data processing instructions, there's also a small set of comparisons available in ARMv7a. Whereas computational instructions typically store a result in one or more general purpose registers, comparison instructions simply update the flags in the special purpose CPSR register. The first example compares General Purpose Register Number 0 with General Purpose Register Number 1. Updating the flags in the Special Purpose CPSR register with respect to the computation of their difference, i.e. subtracting General Purpose Register Number 1 from General Purpose Register Number 0. We can subsequently tell, for example, whether General Purpose Register Number 0 is equal to General Purpose Register Number 1 by inspecting the Z or 0 flag within CPSR. If subtracting General Purpose Register Number 1 from General Purpose Register Number 0 produced 0, then we know that they're equal to each other, and if it didn't, they aren't. Keep in mind that unlike an addition for example, we don't need to add an s onto the end of the instruction identifier here in order to force an update to the flags in CPSR. The only purpose of these instructions is to update those flags, so in some sense the s is implicit. The second important instruction class is the so-called data movement class. Rather than performing computation on data, data movement instructions simply move or copy data around in some way. As a starting point, you could consider the two instruction types illustrated on the slide as being representative of this instruction class. The first type are so-called immediate to register or register to register moves. The first two examples move some source, either the immediate one or general purpose register number one, into a destination, which in both cases is general purpose register number zero. Keep in mind that the term copy arguably describes what's going on here more precisely. What I mean by that is that having moved general purpose register number one into general purpose register number zero, for example, we don't lose or forget the value stored in general purpose register number one. After the move, both General Purpose Registers have the same value, namely whatever value General Purpose Register Number 1 had beforehand. The second type are so-called single-shot memory accesses. The two examples shown here respectively load or store general purpose register number zero to or from memory at an address dictated by general purpose register number one. These are called single-shot memory accesses in the sense that both instructions perform one memory access only. As suggested by the notation used, these instructions are loading and storing 4 bytes or 32 bits of data respectively. The instruction identifiers LDR and SDR can be read as load register and store register, therefore. This really is just a starting point, however, and in fact ARMv7a includes a range of much more exotic data movement instructions and addressing modes. The first case on the slide here describes the so-called multi-shot memory accesses. The idea is that the LDM and STM instructions are analogous to LDR and STR, except that instead of loading or storing one general purpose register to and from memory, they load or store N general purpose registers to or from memory. In this case, N is equal to 3. Of course, we'll perform the same number of memory accesses whether we use n LDR instructions or one LDM instruction. However, using one LDM instruction reduces the memory footprint of our program from n instructions to one instruction, and also reduces the number of fetch decode execute cycles that have to be performed therefore. As a result then, using one LDM instruction might plausibly be more efficient. The second case describes so-called multi-shot stack memory access instructions. The push and pop instructions allow us to respectively push values to or pop values from a stack that's maintained in memory. The two instructions work analogously to LDM and STM in the sense that they load or store multiple general purpose registers to and from memory. However, in addition to doing so, they also manipulate the stack pointer or SP register so as to keep track of where the top of stack is in memory. This slide attempts to give a flavour of some of the different addressing modes that are available in ARMv7a. The first case captures a range of examples that differ in terms of their access granularity. For example, the first two variants load one byte or eight bits of data from memory at an address determined by General Purpose Register Number 1. whereas the bottom two variants load 2 bytes or 16 bits of data. Within the top and bottom, you can identify variants that either sign extend or zero extend the value loaded from memory in order to form a 32-bit result, then stored in general purpose register number zero. The second case captures various examples that differ in terms of the addressing mode used. For example, although the assembly language syntax used might be completely unfamiliar, the second and third variants are very directly using the concepts of indexed and scaled index addressing modes that we encountered previously. The third and final important instruction class relates to the management of control flow. This is an interesting instruction class within the context of ARMv7a because of the unusual way in which control flow is managed. At face value at least, there are only two control flow instruction types available. so-called branch or B instruction, and branch and link or BL instruction. Both of these instruction types have immediate and computed variants, which imply relative and absolute branch types respectively. You can see examples of the four possible variants on the slide here. Beyond this, however, there are a couple of more subtle points to keep in mind. The first of which stems from the fact that the program counter forms part of the general purpose register file and so is therefore generally addressable. This implies that we've extended the definition of what we previously referred to as a computed branch, because basically any instruction, for example a data processing or data movement instruction, can perform the analogy of a branch simply by specifying the program counter as the target operand. Such an instruction would write a new value into the program counter as a result of the instruction semantics and thereby have some influence over control flow. The second point is that in our discussion so far we haven't mentioned the concept of conditional branches at all. Clearly support for conditional branches is important, so the question is how does ARMv7a do that? The answer is a little unusual in the sense that within ARMv7a every instruction is conditionally executed using a concept called predicated execution. As a result, the concept of a conditional branch instruction is simply a special case of conditional instruction more generally. The idea is that within the encoding of every instruction i, there's a 4-bit field which identifies a predicate or condition p. The execution of the instruction is modelled as shown in the middle of the slides. Basically, having fetched and decoded the instruction, we then evaluate the predicate. If the predicate evaluates to true, then execution of that instruction proceeds as normal. However, if the predicate evaluates to false, then the instruction is discarded and not executed, or if you prefer, it's translated into a no-op. From the programmer's perspective, these predicates are specified by adding a suffix onto the end of the instruction identifier within the assembly language description of the instruction. Look at the second example on the slide for instance. The instruction that looks like bne is actually a branch or b instruction with the suffix ne added onto the ends. NE stands for not equals. This identifies the predicate which tests whether the Z field within the CPSR register is equal to zero. Overall then, the semantics of this instruction are such that if the z-field of the CPSR register is equal to zero, we execute an unconditional branch instruction derived by stripping off the suffix from the original. On the other hand, if the z-field of the CPSR register is not equal to zero, we don't execute that branch instruction, we simply discard it and carry on as normal. Overall then, this allows us to support exactly the style of conditional branch instruction we wanted, albeit via a more general purpose mechanism. This table captures all 16 possible predicates, for each one including a description in logical terms, in English language, and also the suffix that one needs to add to the base instruction in order to specify the predicate itself. The final two predicates are interesting cases. The always predicate evaluates to true, meaning that the instruction always executes unconditionally. In some sense this is the default. In contrast, the never predicate evaluates to false, meaning that the instruction never executes. I can't think of too many uses for this, but essentially what it means is that any instruction can be turned into a no-op, rather than there being a dedicated no-op instruction. Effective use of predicated execution, as supported by ARMv7a, requires quite some care with respect to the design and implementation of programs. As an example, consider the GCD function implemented in C, shown on the left-hand side of this slide. Notice that the body of the function makes use of a while loop, whose condition expression tests whether a is not equal to b. The body of the while loop includes an if statement with both an if and an else clause. The condition expression for the if statement tests whether a is greater than b. The assembly language on the right hand side of the slide implements this structure in a fairly direct way, assuming that A and B are stored in general purpose registers number 0 and 1 respectively. It starts by comparing A and B with each other and then using a BEQ or branch instruction with the predicate EQ for equals. If A is equal to B then the branch is executed and the loop terminates. Otherwise if A is not equal to B then the branch is not executed and we proceed by executing the body of the loop. The result of the comparison between A and B is retained, so it then uses a BLT, or branch instruction, with the predicate LT for less than. If A is less than B, then the branch is executed and control flow moves to the else clause. Otherwise, if A is not less than B, then the branch is not executed and control flow moves to the if clause. Having executed either the if or the else clause, we then make use of an unconditional branch back to the start of the loop in order to repeat the process all over again. Keep in mind that our implementation constitutes seven instructions and each iteration of the loop requires the execution of either one or three branch instructions depending on whether the loop terminates or not. Through careful use of predicated execution, we can improve on our implementation as shown on the right-hand side of this slide. This new implementation constitutes just four instructions and requires the execution of just one branch instruction per iteration of the loop. I'm going to intentionally leave it as an exercise for you to think about how and why this works correctly. But at a high level at least, keep in mind that what we've done is flattened the if statement by using predicated execution. Now we have a situation where only one or the other of those subtraction instructions will actually be executed, depending on the result of the comparison between A and B. To sum up then, this was really just a limited introduction to the ARMv7a instruction set architecture, intended to act as a starting point for more hands-on exploration during the associated lab slots. Beyond that though, there's lots of reasons why ARMv7a makes an interesting case study. I've tried to enumerate some of those on the slides. Fundamentally, I think one of the most important reasons is that it acts as a very clear bridge between theory and practice. To put it a different way, in the rest of the week we've really looked at some fairly theoretical, general and quite abstract concepts to do with instruction set architecture and instruction set architecture design. However, hopefully you can see now that a majority of those concepts pan out in this, a real world instruction set architecture that you can actually make use of concretely. Don't underestimate the value of this set within the context of the unit objectives more generally. There's a non-trivial chance that you have a device in your pocket or on your desk right now that supports ARMv7a. So if you consider that one of the objectives of the unit was to explain end-to-end how real computers work, then you can think about this lecture as representing one more step towards doing so.

It's intended for use in contexts where you might find an operating system, for example on mobile phones or laptops. It contrasts with alternatives such as the mobile profile, so ARMv7M, which is essentially a special purpose cut-down variant intended for use in embedded computing platforms. Versus some of the examples that we've used so far, ARMv7a is a real instruction set architecture, which means it's implemented within a range of real microarchitectures you can actually hold in your hand and make use of. The example on the slide is of the ARM Cortex-A7 microarchitecture, an instance of which forms a central component within early generations of the Raspberry Pi platform. The ARMv7a instruction set architecture is a risk-like design in the sense that it follows a risk-like philosophy in the majority of cases.

However, it's also possible to identify cases where it deviates from this philosophy and therefore you could argue that it's not a risk design per se. Either way, it's reasonable to classify ARMv7a as a load-store architecture using a word size of 32 bits. I'm going to try and give you a limited introduction to interesting or relevant features in ARMv7a, structured in the same way as when we looked at instruction set architectures in a general sense. That is, I'm first going to cover the specification of state and then the specification of instructions.

Particularly when discussing the specification of instructions, I'm going to make use of ARM assembly language in all of the examples on the slides. It's important to keep in mind that the goal of doing so is simply to show the relationship between how you'd normally express instructions as a programmer and what the execution semantics of those instructions is. The goal definitely isn't that you're an expert assembly language programmer by the end of the lecture, although the associated lab worksheet does explore this topic in some more detail.

ARMv7a specifies a 16-entry general-purpose register file. Within an encoding instruction, we refer to those registers using a 4-bit register address or index. But in an assembly language program, we can refer to the registers in one of two different ways, shown on the left and the right-hand side of this slide respectively. On the left-hand side, you can see the registers referred to by index.

They're simply named R0 through to R15. On the right hand side, in contrast, you can see those same registers referred to using a more human-readable identifier. For example, A1 through to A4 are so called because they're made use of to store arguments during function calls.

On the slides I've written that this is a generalish purpose register file and what I mean by that is that some of the registers clearly do have a special purpose role. You can see for example that register 15 is actually the program counter or to put it different way, in ARMv7a the program counter isn't a special purpose register, it forms part of the general purpose register file. What that means in practice then is that although some of the registers have a special purpose role, they're all generally addressable. That is, any instruction that allows us to specify a register address as some operand can use the program counter as that operand in more or less the same way as any other register.

ARMv7a also specifies two special purpose registers called the Current Program Status Register or CPSR and the Saved Program Status Register or SPSR. I'm going to focus on the CPSR, but keep in mind that both have the same format as Illustrator. in the middle of the slides and can be described as allowing us to control or configure execution in various ways and also inspect the status of execution. What that means concretely is that if we write certain bits in the CPSR then this will act to control execution of subsequent instructions in some way. Conversely if we read certain bits from the CPSR then this gives us some information about the execution of previous instructions.

In the diagram that describes the CPSR format, you can see pertinent examples towards both the right hand or least significant and the left hand or most significant ends. Bits 0 through to 4 constitute a 5-bit field labelled M. This holds the processor mode and reflects the privilege level in some sense.

For example, depending on the processor mode we might either allow or disallow access to certain regions of memory or execution of certain instructions. Bits 28 to 31 on the other hand are four one-bit flags labeled N, Z, C and V. These flags capture the results of any comparisons that might have been performed during execution of previous instructions. Keep in mind that because the CPSR is a special purpose register we can't make use of it in the same way as elements of the general purpose register file. In order to write a value into the CPSR or read a value from it, we need to make use of two special purpose instructions called MRS and MSR. As part of the specification of state, ARMv7a uses a memory model with a byte addressable 2 to the 32 element address space, which you can see illustrated on the slides.

Using the same terminology we had previously, this means the access granularity is 8 bits. i.e. each element that we're able to access is 8 bits in size. The addresses we use to do so are 32 bits in size, meaning that they range from 0 through to 2 to the 32 minus 1. This gives us a total address space size of 4 gigabytes.

The memory model also includes some rules which govern the semantics of accesses within that address space. You can see some example instances listed on the slide here. The first rule relates to alignment.

For example, if we focus on standard 32-bit ARM instructions, the rule says that such instructions must be located at or aligned to a 4-byte boundary. These instructions are allowed to be located at addresses 0, 4 or 8 therefore, but not 1, 3 or 7. The same sort of rules apply to data. For example, 32-bit words, 16-bit half words, and 8-bit bytes must be located at or aligned to 4-byte, 2-byte, or 1-byte boundaries respectively. The second rule relates to endianness. The rule basically says that access to instructions, for instance within the fetch phase of the fetch-decode-execute cycle, will always be little endian.

In contrast, access to data, for example within the execute phase of the fetch-decode-execute cycle, could either be little endian or big endian. Although not unique to ARMv7a, this feature is somewhat unusual. The decision between little-endian and big-endian data access isn't made on a per-instruction basis, but instead controlled globally, in some sense setting execution into a little-endian or big-endian mode.

By default that mode is little-endian, and that's what we'll assume from here on. Moving on to the specification of instructions then. The first topic to consider is how instructions are encoded and therefore decoded. In short, ARMv7a makes use of a fixed length instruction encoding, where each encoded instruction is 32 bits in length.

Looking at the diagram on the slide however, you could argue this is one topic where ARMv7a deviates from a RISC-like philosophy in the sense that it includes many different instruction formats. Some of those formats are relatively complex, which implies the process of encoding and decoding instructions is also relatively complex. Notice that the majority of instruction formats can be described as three address, in the sense that they allow the specification of up to three register address operands.

There are exceptions however. Both the short and long multiply formats allow the specification of four register address operands rather than three. Recall we used exactly this example while discussing various challenges related to instruction encoding. Also notice that the standard data processing format is somewhat unusual. Although this is a three address format in the sense that it allows the specification of three register address operands, the encoding is such that the field associated with the second operand, labelled operand here, is 12 bits rather than 4 bits.

The purpose and meaning of this is something that we'll come back and explain later in the lecture. As you'd expect, as a general purpose instruction set architecture, ARMv7a includes a range of fairly standard instruction classes. The most straightforward example is the so-called data processing or ALU-like class. This class is straightforward in the sense that selection and semantics of instruction types and variants within it are straightforward. You can see some examples shown on the slide here.

The first two such examples show variants of arithmetic type addition instructions. These respectively allow us to add the immediate one, or the contents of general purpose register two, to general purpose register number one, in both cases storing the result in general purpose register number zero. By default at least, these instructions don't update the n, z, c or v flags in the special purpose CPSR register.

However, We can enable such updates by adding an s onto the end of the instruction identifier. For example, the instruction add s r0 r1 r2 has the same meaning as add r0 r1 r2, in the sense that it adds general purpose register number one to general purpose register number two and stores the result in general purpose register number zero. However, it also updates those flags in the CPSR register to reflect the result computed.

This allows us, for example, to capture whether or not a carry resulted from the associated addition, or whether that addition produced a result that was zero or non-zero. The so-called flexible second operand of any data processing instruction can take one of four forms shown on the slide here. These relate to the slightly unusual instruction format that we discussed earlier.

We've already seen examples of forms 1 and 2, because these relate to that second operand either being an immediate or a register address respectively. Forms 3 and 4 are slightly more exotic in the sense that they relate to the idea that that second operand specifies data via some limited form of computation. The idea is that the data that we end up with results from taking the contents of some general purpose register and either shifting or rotating it by some distance. That distance might be specified by an immediate or by a register address operand.

One way to at least think about this is that ARMv7a is offering us the ability to fuse together standard data processing instructions with either a shift or a rotate. and as a result of this we get some form of 2 for the price of 1. We're able to actually perform two operations for the price of one instruction. This renders instruction execution more efficient in a range of different situations, such as the computation of addresses, where shifts are commonly applied in order to perform some scaling of a base address or an offset.

In addition to the set of what you might call computational data processing instructions, there's also a small set of comparisons available in ARMv7a. Whereas computational instructions typically store a result in one or more general purpose registers, comparison instructions simply update the flags in the special purpose CPSR register. The first example compares General Purpose Register Number 0 with General Purpose Register Number 1. Updating the flags in the Special Purpose CPSR register with respect to the computation of their difference, i.e. subtracting General Purpose Register Number 1 from General Purpose Register Number 0. We can subsequently tell, for example, whether General Purpose Register Number 0 is equal to General Purpose Register Number 1 by inspecting the Z or 0 flag within CPSR. If subtracting General Purpose Register Number 1 from General Purpose Register Number 0 produced 0, then we know that they're equal to each other, and if it didn't, they aren't. Keep in mind that unlike an addition for example, we don't need to add an s onto the end of the instruction identifier here in order to force an update to the flags in CPSR.

The only purpose of these instructions is to update those flags, so in some sense the s is implicit. The second important instruction class is the so-called data movement class. Rather than performing computation on data, data movement instructions simply move or copy data around in some way. As a starting point, you could consider the two instruction types illustrated on the slide as being representative of this instruction class. The first type are so-called immediate to register or register to register moves.

The first two examples move some source, either the immediate one or general purpose register number one, into a destination, which in both cases is general purpose register number zero. Keep in mind that the term copy arguably describes what's going on here more precisely. What I mean by that is that having moved general purpose register number one into general purpose register number zero, for example, we don't lose or forget the value stored in general purpose register number one. After the move, both General Purpose Registers have the same value, namely whatever value General Purpose Register Number 1 had beforehand.

The second type are so-called single-shot memory accesses. The two examples shown here respectively load or store general purpose register number zero to or from memory at an address dictated by general purpose register number one. These are called single-shot memory accesses in the sense that both instructions perform one memory access only. As suggested by the notation used, these instructions are loading and storing 4 bytes or 32 bits of data respectively.

The instruction identifiers LDR and SDR can be read as load register and store register, therefore. This really is just a starting point, however, and in fact ARMv7a includes a range of much more exotic data movement instructions and addressing modes. The first case on the slide here describes the so-called multi-shot memory accesses. The idea is that the LDM and STM instructions are analogous to LDR and STR, except that instead of loading or storing one general purpose register to and from memory, they load or store N general purpose registers to or from memory.

In this case, N is equal to 3. Of course, we'll perform the same number of memory accesses whether we use n LDR instructions or one LDM instruction. However, using one LDM instruction reduces the memory footprint of our program from n instructions to one instruction, and also reduces the number of fetch decode execute cycles that have to be performed therefore. As a result then, using one LDM instruction might plausibly be more efficient. The second case describes so-called multi-shot stack memory access instructions. The push and pop instructions allow us to respectively push values to or pop values from a stack that's maintained in memory.

The two instructions work analogously to LDM and STM in the sense that they load or store multiple general purpose registers to and from memory. However, in addition to doing so, they also manipulate the stack pointer or SP register so as to keep track of where the top of stack is in memory. This slide attempts to give a flavour of some of the different addressing modes that are available in ARMv7a.

The first case captures a range of examples that differ in terms of their access granularity. For example, the first two variants load one byte or eight bits of data from memory at an address determined by General Purpose Register Number 1. whereas the bottom two variants load 2 bytes or 16 bits of data. Within the top and bottom, you can identify variants that either sign extend or zero extend the value loaded from memory in order to form a 32-bit result, then stored in general purpose register number zero. The second case captures various examples that differ in terms of the addressing mode used. For example, although the assembly language syntax used might be completely unfamiliar, the second and third variants are very directly using the concepts of indexed and scaled index addressing modes that we encountered previously.

The third and final important instruction class relates to the management of control flow. This is an interesting instruction class within the context of ARMv7a because of the unusual way in which control flow is managed. At face value at least, there are only two control flow instruction types available.

so-called branch or B instruction, and branch and link or BL instruction. Both of these instruction types have immediate and computed variants, which imply relative and absolute branch types respectively. You can see examples of the four possible variants on the slide here. Beyond this, however, there are a couple of more subtle points to keep in mind.

The first of which stems from the fact that the program counter forms part of the general purpose register file and so is therefore generally addressable. This implies that we've extended the definition of what we previously referred to as a computed branch, because basically any instruction, for example a data processing or data movement instruction, can perform the analogy of a branch simply by specifying the program counter as the target operand. Such an instruction would write a new value into the program counter as a result of the instruction semantics and thereby have some influence over control flow. The second point is that in our discussion so far we haven't mentioned the concept of conditional branches at all. Clearly support for conditional branches is important, so the question is how does ARMv7a do that?

The answer is a little unusual in the sense that within ARMv7a every instruction is conditionally executed using a concept called predicated execution. As a result, the concept of a conditional branch instruction is simply a special case of conditional instruction more generally. The idea is that within the encoding of every instruction i, there's a 4-bit field which identifies a predicate or condition p. The execution of the instruction is modelled as shown in the middle of the slides.

Basically, having fetched and decoded the instruction, we then evaluate the predicate. If the predicate evaluates to true, then execution of that instruction proceeds as normal. However, if the predicate evaluates to false, then the instruction is discarded and not executed, or if you prefer, it's translated into a no-op. From the programmer's perspective, these predicates are specified by adding a suffix onto the end of the instruction identifier within the assembly language description of the instruction.

Look at the second example on the slide for instance. The instruction that looks like bne is actually a branch or b instruction with the suffix ne added onto the ends. NE stands for not equals.

This identifies the predicate which tests whether the Z field within the CPSR register is equal to zero. Overall then, the semantics of this instruction are such that if the z-field of the CPSR register is equal to zero, we execute an unconditional branch instruction derived by stripping off the suffix from the original. On the other hand, if the z-field of the CPSR register is not equal to zero, we don't execute that branch instruction, we simply discard it and carry on as normal. Overall then, this allows us to support exactly the style of conditional branch instruction we wanted, albeit via a more general purpose mechanism. This table captures all 16 possible predicates, for each one including a description in logical terms, in English language, and also the suffix that one needs to add to the base instruction in order to specify the predicate itself.

The final two predicates are interesting cases. The always predicate evaluates to true, meaning that the instruction always executes unconditionally. In some sense this is the default. In contrast, the never predicate evaluates to false, meaning that the instruction never executes.

I can't think of too many uses for this, but essentially what it means is that any instruction can be turned into a no-op, rather than there being a dedicated no-op instruction. Effective use of predicated execution, as supported by ARMv7a, requires quite some care with respect to the design and implementation of programs. As an example, consider the GCD function implemented in C, shown on the left-hand side of this slide.

Notice that the body of the function makes use of a while loop, whose condition expression tests whether a is not equal to b. The body of the while loop includes an if statement with both an if and an else clause. The condition expression for the if statement tests whether a is greater than b.

The assembly language on the right hand side of the slide implements this structure in a fairly direct way, assuming that A and B are stored in general purpose registers number 0 and 1 respectively. It starts by comparing A and B with each other and then using a BEQ or branch instruction with the predicate EQ for equals. If A is equal to B then the branch is executed and the loop terminates. Otherwise if A is not equal to B then the branch is not executed and we proceed by executing the body of the loop. The result of the comparison between A and B is retained, so it then uses a BLT, or branch instruction, with the predicate LT for less than.

If A is less than B, then the branch is executed and control flow moves to the else clause. Otherwise, if A is not less than B, then the branch is not executed and control flow moves to the if clause. Having executed either the if or the else clause, we then make use of an unconditional branch back to the start of the loop in order to repeat the process all over again.

Keep in mind that our implementation constitutes seven instructions and each iteration of the loop requires the execution of either one or three branch instructions depending on whether the loop terminates or not. Through careful use of predicated execution, we can improve on our implementation as shown on the right-hand side of this slide. This new implementation constitutes just four instructions and requires the execution of just one branch instruction per iteration of the loop.

I'm going to intentionally leave it as an exercise for you to think about how and why this works correctly. But at a high level at least, keep in mind that what we've done is flattened the if statement by using predicated execution. Now we have a situation where only one or the other of those subtraction instructions will actually be executed, depending on the result of the comparison between A and B. To sum up then, this was really just a limited introduction to the ARMv7a instruction set architecture, intended to act as a starting point for more hands-on exploration during the associated lab slots.

Beyond that though, there's lots of reasons why ARMv7a makes an interesting case study. I've tried to enumerate some of those on the slides. Fundamentally, I think one of the most important reasons is that it acts as a very clear bridge between theory and practice.

To put it a different way, in the rest of the week we've really looked at some fairly theoretical, general and quite abstract concepts to do with instruction set architecture and instruction set architecture design. However, hopefully you can see now that a majority of those concepts pan out in this, a real world instruction set architecture that you can actually make use of concretely. Don't underestimate the value of this set within the context of the unit objectives more generally. There's a non-trivial chance that you have a device in your pocket or on your desk right now that supports ARMv7a.

So if you consider that one of the objectives of the unit was to explain end-to-end how real computers work, then you can think about this lecture as representing one more step towards doing so.

Transcript for:Understanding ARMv7a Architecture and Instructions

Transcript for:
Understanding ARMv7a Architecture and Instructions