Think In Geek

In geek we trust

# ARM assembler in Raspberry Pi – Chapter 13

So far, all examples have dealt with integer values. But processors would be rather limited if they were only able to work with integer values. Fortunately they can work with floating point numbers. In this chapter we will see how we can use the floating point facilities of our Raspberry Pi.

## Floating point numbers

Following is a quick recap of what is a floating point number.

A binary floating point number is an approximate representation of a real number with three parts: sign, mantissa and exponent. The sign may be just 0 or 1, meaning 1 a negative number, positive otherwise. The mantissa represents a fractional magnitude. Similarly to 1.2345 we can have a binary `1.01110` where every digit is just a bit. The dot means where the integer part ends and the fractional part starts. Note that there is nothing special in binary fractional numbers: `1.01110` is just 20 + 2-2 + 2-3 + 2-4 = 1.43750(10. Usually numbers are normalized, this means that the mantissa is adjusted so the integer part is always 1, so instead of 0.00110101 we would represent 1.101101 (in fact a floating point may be a denormal if this property does not hold, but such numbers lie in a very specific range so we can ignore them here). If the mantissa is adjusted so it always has a single 1 as the integer part two things happen. First, we do not represent the integer part (as it is always 1 in normalized numbers). Second, to make things sound we need an exponent which compensates the mantissa being normalized. This means that the number -101.110111 (remember that it is a binary real number) will be represented by a sign = 1, mantissa = 1.01110111 and exponent = 2 (because we moved the dot 2 digits to the left). Similarly, number 0.0010110111 is represented with a sign = 0, mantissa = 1.0110111 and exponent = -3 (we moved the dot 3 digits to the right).

In order for different computers to be able to share floating point numbers, IEEE 754 standardizes the format of a floating point number. VFPv2 supports two of the IEEE 754 numbers: Binary32 and Binary64, usually known by their C types, `float` and `double`, or by single- and double-precision, respectively. In a single-precision floating point the mantissa is 23 bits (+1 of the integer one for normalized numbers) and the exponent is 8 bits (so the exponent ranges from -126 to 127). In a double-precision floating point the mantissa is 52 bits (+1) and the exponent is 11 bits (so the exponent ranges from -1022 to 1023). A single-precision floating point number occupies 32 bit and a double-precision floating point number occupies 64 bits. Operating double-precision numbers is in average one and a half to twice slower than single-precision.

Goldberg’s famous paper is a classical reference that should be read by anyone serious when using floating point numbers.

## Coprocessors

As I stated several times in earlier chapters, ARM was designed to be very flexible. We can see this in the fact that ARM architecture provides a generic coprocessor interface. Manufacturers of system-on-chips may bundle additional coprocessors. Each coprocessor is identified by a number and provides specific instructions. For instance the Raspberry Pi SoC is a BCM2835 which provides a multimedia coprocessor (which we will not discuss here).

That said, there are two standard coprocessors in the ARMv6 architecture: 10 and 11. These two coprocessors provide floating point support for single and double precision, respectively. Although the floating point instructions have their own specific names, they are actually mapped to generic coprocessor instructions targeting coprocessor 10 and 11.

## Vector Floating-point v2

ARMv6 defines a floating point subarchitecture called the Vector Floating-point v2 (VFPv2). Version 2 because earlier ARM architectures supported a simpler form called now v1. As stated above, the VFP is implemented on top of two standarized coprocessors 10 and 11. ARMv6 does not require VFPv2 be implemented in hardware (one can always resort to a slower software implementation). Fortunately, the Raspberry Pi does provide a hardware implementation of VFPv2.

## VFPv2 Registers

We already know that the ARM architecture provides 16 general purpose registers `r0` to `r15`, where some of them play special roles: `r13`, `r14` and `r15`. Despite their name, these general purpose registers do not allow operating floating point numbers in them, so VFPv2 provides us with some specific registers. These registers are named `s0` to `s31`, for single-precision, and `d0` to `d15` for double precision. These are not 48 different registers. Instead every `dn` is mapped to two (consecutive) registers `s2n` and `s2n+1`, where 0 ≤ `n` ≤ 15.

These registers are structured in 4 banks: `s0``s7` (`d0``d3`), `s8``s15` (`d4``d7`), `s16``s23` (`d8``d11`) and `s24``s31` (`d12``d15`). We will call the first bank (bank 0, `s0``s7`, `d0``d3`) the scalar bank, while the remaining three are vectorial banks (below we will see why).

VFPv2 provides three control registers but we will only be interested in one called `fpscr`. This register is similar to the `cpsr` as it keeps the usual comparison flags `N`, `Z`, `C` and `V`. It also stores two fields that are very useful, `len` and `stride`. These two fields control how floating point instructions behave. We will not care very much of the remaining information in this register: status information of the floating point exceptions, the current rounding mode and whether denormal numbers are flushed to zero.

## Arithmetic operations

Most VFPv2 instructions are of the form `vname Rdest, Rsource1, Rsource2` or `fname Rdest, Rsource1`. They have three modes of operation.

• Scalar. This mode is used when the destination register is in bank 0 (`s0``s7` or `d0``d3`). In this case, the instruction operates only with `Rsource1` and `Rsource2`. No other registers are involved.
• Vectorial. This mode is used when the destination register and Rsource2 (or Rsource1 for instructions with only one source register) are not in the bank 0. In this case the instruction will operate as many registers (starting from the given register in the instruction and wrapping around the bank of the register) as defined in field `len` of the `fpscr` (at least 1). The next register operated is defined by the `stride` field of the `fpscr` (at least 1). If wrap-around happens, no register can be operated twice.
• Scalar expanded (also called mixed vector/scalar). This mode is used if Rsource2 (or Rsource1 if the instruction only has one source register) is in the bank0, but the destination is not. In this case Rsource2 (or Rsource1 for instructions with only one source) is left fixed as the source. The remaining registers are operated as in the vectorial case (this is, using `len` and `stride` from the `fpscr`).

Ok, this looks pretty complicated, so let’s see some examples. Most instructions end in `.f32` if they operate on single-precision and in `.f64` if they operate in double-precision. We can add two single-precision numbers using `vadd.f32 Rdest, Rsource1, Rsource2` and double-precision using `vadd.f64 Rdest, Rsource1, Rsource2`. Note also that we can use predication in these instructions (but be aware that, as usual, predication uses the flags in `cpsr` not in `fpscr`). Predication would be specified before the suffix like in `vaddne.f32`.

```// For this example assume that len = 4, stride = 2 vadd.f32 s1, s2, s3 /* s1 ← s2 + s3. Scalar operation because Rdest = s1 in the bank 0 */ vadd.f32 s1, s8, s15 /* s1 ← s8 + s15. ditto */ vadd.f32 s8, s16, s24 /* s8 ← s16 + s24 s10 ← s18 + s26 s12 ← s20 + s28 s14 ← s22 + s30 or more compactly {s8,s10,s12,s14} ← {s16,s18,s20,s22} + {s24,s26,s28,s30} Vectorial, since Rdest and Rsource2 are not in bank 0 */ vadd.f32 s10, s16, s24 /* {s10,s12,s14,s8} ← {s16,s18,s20,s22} + {s24,s26,s28,s30}. Vectorial, but note the wraparound inside the bank after s14. */ vadd.f32 s8, s16, s3 /* {s8,s10,s12,s14} ← {s16,s18,s20,s22} + {s3,s3,s3,s3} Scalar expanded since Rsource2 is in the bank 0 */```

Once we have a rough idea of how we can operate floating points in VFPv2, a question remains: how do we load/store floating point values from/to memory? VFPv2 provides several specific load/store instructions.

We load/store one single-precision floating point using `vldr`/`vstr`. The address of the load/store must be already in a general purpose register, although we can apply an offset in bytes which must be a multiple of 4 (this applies to double-precision as well).

```vldr s1, [r3] /* s1 ← *r3 */ vldr s2, [r3, #4] /* s2 ← *(r3 + 4) */ vldr s3, [r3, #8] /* s3 ← *(r3 + 8) */ vldr s4, [r3, #12] /* s4 ← *(r3 + 12) */   vstr s10, [r4] /* *r4 ← s10 */ vstr s11, [r4, #4] /* *(r4 + 4) ← s11 */ vstr s12, [r4, #8] /* *(r4 + 8) ← s12 */ vstr s13, [r4, #12] /* *(r4 + 12) ← s13 */```

We can load/store several registers with a single instruction. In contrast to general load/store, we cannot load an arbitrary set of registers but instead they must be a sequential set of registers.

```// Here precision can be s or d for single-precision and double-precision
// floating-point-register-set is {sFirst-sLast} for single-precision
// and {dFirst-dLast} for double-precision
vldm indexing-mode precision Rbase{!}, floating-point-register-set
vstm indexing-mode precision Rbase{!}, floating-point-register-set
```

The behaviour is similar to the indexing modes we saw in chapter 10. There is a Rbase register used as the base address of several load/store to/from floating point registers. There are only two indexing modes: increment after and decrement before. When using increment after, the address used to load/store the floating point value register is increased by 4 after the load/store has happened. When using decrement before, the base address is first subtracted as many bytes as foating point values are going to be loaded/stored. Rbase is always updated in decrement before but it is optional to update it in increment after.

```vldmias r4, {s3-s8} /* s3 ← *r4 s4 ← *(r4 + 4) s5 ← *(r4 + 8) s6 ← *(r4 + 12) s7 ← *(r4 + 16) s8 ← *(r4 + 20) */ vldmias r4!, {s3-s8} /* Like the previous instruction but at the end r4 ← r4 + 24 */ vstmdbs r5!, {s12-s13} /* *(r5 - 4 * 2) ← s12 *(r5 - 4 * 1) ← s13 r5 ← r5 - 4*2 */```

For the usual stack operations when we push onto the stack several floating point registers we will use `vstmdb` with `sp!` as the base register. To pop from the stack we will use `vldmia` again with `sp!` as the base register. Given that these instructions names are very hard to remember we can use the mnemonics `vpush` and `vpop`, respectively.

```vpush {s0-s5} /* Equivalent to vstmdb sp!, {s0-s5} */ vpop {s0-s5} /* Equivalent to vldmia sp!, {s0-s5} */```

## Movements between registers

Another operation that may be required sometimes is moving among registers. Similar to the `mov` instruction for general purpose registers there is the `vmov` instruction. Several movements are possible.

We can move floating point values between two floating point registers of the same precision

`vmov s2, s3 /* s2 ← s3 */`

Between one general purpose register and one single-precision register. But note that data is not converted. Only bits are copied around, so be aware of not mixing floating point values with integer instructions or the other way round.

```vmov s2, r3 /* s2 ← r3 */ vmov r4, s5 /* r4 ← s5 */```

Like the previous case but between two general purpose registers and two consecutive single-precision registers.

```vmov s2, s3, r4, r10 /* s2 ← r4 s3 ← r10 */```

Between two general purpose registers and one double-precision register. Again, note that data is not converted.

```vmov d3, r4, r6 /* Lower32BitsOf(d3) ← r4 Higher32BitsOf(d3) ← r6 */ vmov r5, r7, d4 /* r5 ← Lower32BitsOf(d4) r7 ← Higher32BitsOf(d4) */```

## Conversions

Sometimes we need to convert from an integer to a floating-point and the opposite. Note that some conversions may potentially lose precision, in particular when a floating point is converted to an integer. There is a single instruction `vcvt` with a suffix `.T.S` where `T` (target) and `S` (source) can be `u32`, `s32`, `f32` and `f64` (`S` must be different to `T`). Both registers must be floating point registers, so in order to convert integers to floating point or floating point to an integer value an extra `vmov` instruction will be required from or to an integer register before or after the conversion. Because of this, for a moment (between the two instructions) a floating point register will contain a value which is not a IEEE 754 value, bear this in mind.

```vcvt.f64.f32 d0, s0 /* Converts s0 single-precision value to a double-precision value and stores it in d0 */   vcvt.f32.f64 s0, d0 /* Converts d0 double-precision value to a single-precision value and stores it in s0 */   vmov s0, r0 /* Bit copy from integer register r0 to s0 */ vcvt.f32.s32 s0, s0 /* Converts s0 signed integer value to a single-precision value and stores it in s0 */   vmov s0, r0 /* Bit copy from integer register r0 to s0 */ vcvt.f32.u32 s0, s0 /* Converts s0 unsigned integer value to a single-precision value and stores in s0 */   vmov s0, r0 /* Bit copy from integer register r0 to s0 */ vcvt.f64.s32 d0, s0 /* Converts r0 signed integer value to a double-precision value and stores in d0 */   vmov s0, r0 /* Bit copy from integer register r0 to s0 */ vcvt.f64.u32 d0, s0 /* Converts s0 unsigned integer value to a double-precision value and stores in d0 */```

## Modifying fpscr

The special register fpscr, where `len` and `stride` are set, cannot be modified directly. Instead we have to load fpscr into a general purpose register using `vmrs` instruction. Then we operate on the register and move it back to the `fpscr`, using the `vmsr` instruction.

The value of `len` is stored in bits 16 to 18 of `fpscr`. The value of `len` is not directly stored directly in these bits. Instead, we have to subtract 1 before setting the bits. This is because `len` cannot be 0 (it does not make sense to operate 0 floating points). This way the value `000` in these bits means `len` = 1, `001` means `len` = 2, …, `111` means `len` = 8. The following is a code that sets `len` to 8.

```/* Set the len field of fpscr to be 8 (bits: 111) */ mov r5, #7 /* r5 ← 7. 7 is 111 in binary */ mov r5, r5, LSL #16 /* r5 ← r5 << 16 */ vmrs r4, fpscr /* r4 ← fpscr */ orr r4, r4, r5 /* r4 ← r4 | r5. Bitwise OR */ vmsr fpscr, r4 /* fpscr ← r4 */```

`stride` is stored in bits 20 to 21 of `fpscr`. Similar to `len`, a value of `00` in these bits means `stride` = 1, `01` means `stride` = 2, `10` means `stride` = 3 and `11` means `stride` = 4.

## Function call convention and floating-point registers

Since we have introduced new registers we should state how to use them when calling functions. The following rules apply for VFPv2 registers.

• Fields `len` and `stride` of `fpscr` have all their bits as zero at the entry of a function and those bits must be zero when leaving it.
• We can pass floating point parameters using registers `s0``s15` and `d0``d7`. Note that passing a double-precision after a single-precision may involve discarding an odd-numbered single-precision register (for instance we can use `s0`, and `d1` but note that `s1` will be unused).
• All other floating point registers (`s16``s31` and `d8``d15`) must have their values preserved upon leaving the function. Instructions `vpush` and `vpop` can be used for that.
• If a function returns a floating-point value, the return register will be `s0` or `d0`.

Finally a note about variadic functions like printf: you cannot pass a single-precision floating point to one of such functions. Only doubles can be passed. So you will need to convert the single-precision values into double-precision values. Note also that usual integer registers are used (`r0``r3`), so you will only be able to pass up to 2 double-precision values, the remaining must be passed on the stack. In particular for `printf`, since `r0` contains the address of the string format, you will only be able to pass a double-precision in `{r2,r3}`.

## Assembler

Make sure you pass the flag `-mfpu=vfpv2` to `as`, otherwise it will not recognize the VFPv2 instructions.

## Colophon

You may want to check this official quick reference card of VFP. Note that it includes also VFPv3 not available in the Raspberry Pi processor. Most of what is there has already been presented here although some minor details may have been omitted.

In the next chapter we will use these instructions in a full example.

That’s all for today.

### 24 thoughts on “ARM assembler in Raspberry Pi – Chapter 13”

• I am enjoying the series – I seem to read about things here just after I have been looking into them 🙂

” The address of the load/store must be already in a general purpose register, although we can apply an offset in bytes which must be a multiple of 4 (this applies to double-precision as well). ”

Would the offset not need to be a multiple of 8 for double-precision?

• 111 means lean = 8 – should be len=8.

If I can answer my previous question, I guess that the *address* of the double is a 32-bit int, so an offset of 4 is valid?

• rferrer says:

Thanks, I fixed that lean that slipped in.

Regarding to your question: a double-precision should be 8-byte aligned per the AAPCS but, as far I’ve tested, a `vldr` or `vstr` do not seem to care very much about this constraint. If we follow the AAPCS our double-precision would always be 8-bytes aligned.

That alignment issue, though, is orthogonal to the offset itself. `vstr` and `vldr` are actually ARM generic coprocessor instructions with an appropiate 10 or 11 identifier for the coprocessor. Such generic instructions define an offset that must be a multiple of 4. Imagine we have `r1 ← 0x104` and the instruction is `vldr d0, [r1, #4]`, the effect will be `d0 ← *(r1 + #4)` so `d0 ← *(0x108)` which is 8-byte aligned. This would still be compliant with the AAPCS.

• Elie De Brauwer says:

One comment, you are giving the following example: “vcvt.f32.u32 s4, r3” however the assembler says:

Error: VFP single, double or Neon quad precision register expected — `vcvt.f32.u32 s4,r3′

Also vcvt is specified as in the VFP Quick Reference cad as VCVT{C}.U32 Fd, Sm so I think that the example of converting u32 to f32 should be a two step thing like:
vmov s4, r4
vcvt.f32.u32 s4, s4

• rferrer says:

Hi Elie,

I fixed that in the post.

Thank you very much!

• Henryk Sarat says:

Double point precision numbers consist of:
1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa.

This tripped me up for a bit. Please correct so no one else is confused!

• rferrer says:

Oops. Fixed. Thanks!

• RobG says:

A great series of posts, has been very helpful understanding my new RPi.

It would be really helpful to newcomers for each post to include links to the next and previous – this works well for 1-12, but I only just realised there were more!

• rferrer says:

Hi RobG,

I’ll try to fix though I’m not an ace in this wordpress thingy! 🙂

Kind regards,

• RobG says:

Excellent.

With help from this series and Cambridge tutorials, I’ve just managed to boot my PI into custom code. (I’d hesitate to call it an O/S yet, but I’ve got as far as a (very) basic CLI.

Many thanks – I found it hard getting useful info from the ARM documentation – had to keep cross-referencing the ARMv7 and ARM11 documents. This brings it all together nicely and reminded me of some stuff I’ve not had do manually for decades – (ZX Spectrum – I used to literally hand code z80 assembly to hex for that – this is much less painful).

• janan says:

thanks you , please can you tell me what is the diference between Vector Floating-point v2 (VFPv2) , and floating point unit …
and how to use the floating point support …
i’m using raspberry pi 2 under raspbian

• rferrer says:

Hi janan,

a floating point unit (FPU) is a generic name for a coprocessor or (more commonly nowadays) a functional unit inside a processor that performs floating point operations.

VFPv2 is the name ARM gives to an extension to the ARM architecture that provides floating point support in hardware. Note that, while the architectural “interface” to the VFPv2 is via the coprocessor mechanism, this does not preclude that it can be integrated in a single chip.

Kind regards,

• “These are not 48 different registers. Instead every dn is mapped to two consecutive sn and sn+1, where n is an even number lower than 31.”

This should be corrected to: “every d_n is mapped to two consecutive registers s_2n and s_2n+1, where 0 <= n <= 15."

• Roger Ferrer Ibáñez says:

Hi William,

thanks a lot for the suggestion. I applied it to the text.

Kind regards,

• Sam Ferguson says:

Will VFP instructions assemble from the terminal like earlier examples? I am getting an Error: selected processor does not support ARM mode `add.f32 s1, s2, s2′
Just wondering if this is a limitation or something I have not setup correctly.

Really enjoying the tutorials, thank you very much!

• Roger Ferrer Ibáñez says:

Hi Sam,

the instruction is called `vadd.f32` but likely you forgot to pass `-mfpu=vfpv2`.

Kind regards,

“Most VFPv2 instructions are of the form fname Rdest, Rsource1, Rsource2 or fname Rdest, Rsource1.”

but it appears that the instructions are prefixed with “v” and not “f”. Did I misunderstand the notation?

• Roger Ferrer Ibáñez says:

Hi Andreas,

no you did not. It was a mistake of mine. I have already fixed the post.

Thank you!

• zair says:

Great work of sharing things, Was able to follow up and experiment.
I was experimenting the code with intrinsics/ACLE,
vcvt_u32_f32 did not do the intended job, maybe my understanding of vcvt for converting floating point to fixed point is not clear.

— snip—
const float temp[2] = {2.84};
float32x2_t z = vld1_f32(temp);
uint32x2_t in = vcvt_u32_f32(z);

y[0] = vget_lane_u32(in, 0);
y[1] = vget_lane_u32(in, 1);

— end of snip —

It would be a great help if i get through getting 3 floating point parts in separate 3 integers.

• Roger Ferrer Ibáñez says:

Hi,

thanks!

I’m not expert in ACLE and I’m also not sure what you mean by “getting 3 floating point parts in separate 3 integers”? Is there a reason you are using types that apparently only encode 2 floats? And finally, these intrinsics I think they are only for NEON. But the Raspberry Pi model used in this tutorial does not support NEON.

Kind regards,
Roger

• Rohan says:

Hello Mr. Ibáñez,

Fields `len` and `stride` of `fpscr` are zero at the entry of a function and must be zero when leaving it.

`len` and `stride` cannot be zero, so did you mean to write one instead?

Thank you

• Roger Ferrer Ibáñez says:

Hi Rohan,

I was a bit unclear here: you’re right in that they cannot be logically zero. But because of that a physical zero means one. I will update the text.

Thank you.

• Christian Spielberger says:

Hi Roger,

There is a tiny mistake.

vldr s4, [r3, #12] /* s3 ← *(r3 + 12) */

r,c.

• Roger Ferrer Ibáñez says:

Hi Christian,

thanks for the heads up. Now it should be fixed.

Kind regards!

This site uses Akismet to reduce spam. Learn how your comment data is processed.