<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Think In Geek</title>
	<atom:link href="http://thinkingeek.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://thinkingeek.com</link>
	<description>In geek we trust</description>
	<lastBuildDate>Tue, 14 May 2013 17:34:52 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 14</title>
		<link>http://thinkingeek.com/2013/05/12/arm-assembler-raspberry-pi-chapter-14/</link>
		<comments>http://thinkingeek.com/2013/05/12/arm-assembler-raspberry-pi-chapter-14/#comments</comments>
		<pubDate>Sun, 12 May 2013 14:33:25 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=1003</guid>
		<description><![CDATA[In chapter 13 we saw the basic elements of VFPv2, the floating point subarchitecture of ARMv6. In this chapter we will implement a floating point matrix multiply using VFPv2. Disclaimer: I advise you against using the code in this chapter in commercial-grade projects unless you fully review it for both correctness and precision. Matrix multiply [...]]]></description>
				<content:encoded><![CDATA[<p>
In chapter 13 we saw the basic elements of VFPv2, the floating point subarchitecture of ARMv6. In this chapter we will implement a floating point matrix multiply using VFPv2.
</p>
<p><span id="more-1003"></span></p>
<p style="background-color: #ffe1e1; padding: 1em;">
<b>Disclaimer</b>: I advise you against using the code in this chapter in commercial-grade projects unless you fully review it for both correctness and precision.
</p>
<h2>Matrix multiply</h2>
<p>
Given two vectors <strong>v</strong> and <strong>w</strong> of rank <em>r</em> where <strong>v</strong> = &lt;v<sub>0</sub>, v<sub>1</sub>, &#8230; v<sub>r-1</sub>&gt; and w = &lt;w<sub>0</sub>, w<sub>1</sub>, &#8230;, w<sub>r-1</sub>>, we define the <em>dot product</em> of <strong>v</strong> by <strong>w</strong> as the scalar <strong>v</strong> · <strong>w</strong> = v<sub>0</sub>×w<sub>0</sub> + v<sub>1</sub>×w<sub>0</sub> + &#8230; + v<sub>r-1</sub>×w<sub>r-1</sub>.
</p>
<p>
We can multiply a matrix <code>A</code> of <code>n</code> rows and <code>m</code> columns (<code>n</code> x <code>m</code>) by a matrix <code>B</code> of <code>m</code> rows and <code>p</code> columns (<code>m</code> x <code>p</code>). The result is a matrix of <code>n</code> rows and </code>p</code> columns. Matrix multiplication may seem complicated but actually it is not. Every element in the result matrix it is just the dot product (defined in the paragraph above) of the corresponding row of the matrix <code>A</code> by the corresponding column of the matrix <code>B</code> (this is why there must be as many columns in <code>A</code> as there are rows in <code>B</code>).
</p>
<p><img src="http://thinkingeek.com/wp-content/uploads/2013/04/matmul.png" alt="Matrix multiplication schema" width="579" height="504" class="aligncenter size-full wp-image-1006" /></p>
<p>
A straightforward implementation of the matrix multiplication in C is as follows.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">float</span> A<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>M<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// N rows of M columns each row</span>
<span style="color: #993333;">float</span> B<span style="color: #009900;">&#91;</span>M<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>P<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// M rows of P columns each row</span>
<span style="color: #666666; font-style: italic;">// Result</span>
<span style="color: #993333;">float</span> C<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>P<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #666666; font-style: italic;">// for each row of the result</span>
<span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> j <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> P<span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #666666; font-style: italic;">// and for each column</span>
  <span style="color: #009900;">&#123;</span>
    C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// Initialize to zero</span>
    <span style="color: #666666; font-style: italic;">// Now make the dot matrix of the row by the column</span>
    <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> k <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> k <span style="color: #339933;">&lt;</span> M<span style="color: #339933;">;</span> k<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
       C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>
In order to simplify the example, we will asume that both matrices A and B are square matrices of size <code>n x n</code>. This simplifies just a bit the algorithm.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">float</span> A<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">float</span> B<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// Result</span>
<span style="color: #993333;">float</span> C<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> j <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
    C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> k <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> k <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> k<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
       C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>
Matrix multiplication is an important operation used in many areas. For instance, in computer graphics is usually performed on 3x3 and 4x4 matrices representing 3D geometry. So we will try to make a reasonably fast version of it (we do not aim at making the best one, though).
</p>
<p>
A first improvement we want to do in this algorithm is making the loops perfectly nested. There are some technical reasons beyond the scope of this code for that. So we will get rid of the initialization of <code>C[i][j]</code> to 0, outside of the loop.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">float</span> A<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">float</span> B<span style="color: #009900;">&#91;</span>M<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// Result</span>
<span style="color: #993333;">float</span> C<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> j <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
    C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> j <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
    <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> k <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> k <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> k<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
       C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
After this change, the interesting part of our algorithm, line 13, is inside a perfect nest of loops of depth 3.
</p>
<h3>Accessing a matrix</h3>
<p>
It is relatively straightforward to access an array of just one dimension, like in <code>a[i]</code>. Just get <code>i</code>, multiply it by the size in bytes of each element of the array and then add the address of <code>a</code> (the base address of the array). So, the address of <code>a[i]</code> is just <code>a + ELEMENTSIZE*i</code>.
</p>
<p>
Things get a bit more complicated when our array has more than one dimension, like a matrix or a cube. Given an access like <code>a[i][j][k]</code> we have to compute which element is denoted by <code>[i][j][k]</code>. This depends on whether the language is row-major order or column-major order. We assume row-major order here (like in C language). So <code>[i][j][k]</code> must denote <code>k + j * NK + i * NK * NJ</code>, where <code>NK</code> and <code>NJ</code> are the number of elements in every dimension. For instance, a three dimensional array of 3 x 4 x 5 elements, the element [1][2][3] is 3 + 2 * 5 + 1 * 5 * 4 = 23 (here <code>NK</code> = 5 and <code>NJ</code> = 4. Note that <code>NI</code> = 3 but we do not need it at all). We assume that our language indexes arrays starting from 0 (like C). If the language allows a lower bound other than 0, we first have to substract the lower bound to get the position.
</p>
<p>
We can compute the position in a slightly better way if we reorder it. Instead of calculating  <code>k + j * NK + i * NK * NJ</code> we will do <code>k + NK * (j + NJ * i)</code>. This way all the calculus is just a repeated set of steps calculating <code>x + N<sub>i</sub> * y</code> like in the example below.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> Calculating the address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> declared as &lt;<span style="color: #0000ff; font-weight: bold;">code</span>&gt;<span style="color: #00007f; font-weight: bold;">int</span> C<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">5</span><span style="color: #009900; font-weight: bold;">&#93;</span>&lt;<span style="color: #339933;">/</span><span style="color: #0000ff; font-weight: bold;">code</span>&gt; <span style="color: #339933;">*/</span>
<span style="color: #339933;">/*</span> &amp;C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> is<span style="color: #339933;">,</span> thus<span style="color: #339933;">,</span> C <span style="color: #339933;">+</span> ELEMENTSIZE <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span> k <span style="color: #339933;">+</span> NK <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> NJ <span style="color: #339933;">*</span> i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #339933;">//</span> Assume i is <span style="color: #00007f; font-weight: bold;">in</span> r4<span style="color: #339933;">,</span> j <span style="color: #00007f; font-weight: bold;">in</span> r5 <span style="color: #00007f; font-weight: bold;">and</span> k <span style="color: #00007f; font-weight: bold;">in</span> r6 <span style="color: #00007f; font-weight: bold;">and</span> the base address of C <span style="color: #00007f; font-weight: bold;">in</span> r3 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>        <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← NJ <span style="color: #009900; font-weight: bold;">&#40;</span>Recall that NJ = <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mul</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r4    <span style="color: #339933;">/*</span> r7 ← NJ <span style="color: #339933;">*</span> i <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r7<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r7    <span style="color: #339933;">/*</span> r7 ← j <span style="color: #339933;">+</span> NJ <span style="color: #339933;">*</span> i <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">5</span>        <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← NJ <span style="color: #009900; font-weight: bold;">&#40;</span>Recall that NK = <span style="color: #ff0000;">5</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mul</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r7    <span style="color: #339933;">/*</span> r7 ← NK <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> NJ <span style="color: #339933;">*</span> i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r7<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7    <span style="color: #339933;">/*</span> r7 ← k <span style="color: #339933;">+</span> NK <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> NJ <span style="color: #339933;">+</span> i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>        <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← ELEMENTSIZE <span style="color: #009900; font-weight: bold;">&#40;</span>Recall that size of an <span style="color: #00007f; font-weight: bold;">int</span> is <span style="color: #ff0000;">4</span> bytes<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mul</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r7    <span style="color: #339933;">/*</span> r7 ← ELEMENTSIZE <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span> k <span style="color: #339933;">+</span> NK <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> NJ <span style="color: #339933;">*</span> i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r7<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> r7    <span style="color: #339933;">/*</span> r7 ← C <span style="color: #339933;">+</span> ELEMENTSIZE <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span> k <span style="color: #339933;">+</span> NK <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> NJ <span style="color: #339933;">*</span> i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h2>Naive matrix multiply of 4x4 single-precision</h2>
<p>
As a first step, let's implement a naive matrix multiply that follows the C algorithm above according to the letter.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> matmul<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
mat_A<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.1</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.2</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.1</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.2</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.1</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.3</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.3</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.1</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.5</span> 
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.6</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.4</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.1</span>
mat_B<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span>  <span style="color: #ff0000;">4.92</span><span style="color: #339933;">,</span>  <span style="color: #ff0000;">2.54</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #ff0000;">0.63</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #ff0000;">1.75</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span>  <span style="color: #ff0000;">3.02</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #ff0000;">1.51</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #ff0000;">0.87</span><span style="color: #339933;">,</span>  <span style="color: #ff0000;">1.35</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #339933;">-</span><span style="color: #ff0000;">4.29</span><span style="color: #339933;">,</span>  <span style="color: #ff0000;">2.14</span><span style="color: #339933;">,</span>  <span style="color: #ff0000;">0.71</span><span style="color: #339933;">,</span>  <span style="color: #ff0000;">0.71</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #339933;">-</span><span style="color: #ff0000;">0.95</span><span style="color: #339933;">,</span>  <span style="color: #ff0000;">0.48</span><span style="color: #339933;">,</span>  <span style="color: #ff0000;">2.38</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #ff0000;">0.95</span>
mat_C<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span>
       <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">float</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">0.0</span>
&nbsp;
format_result <span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Matrix result is:\n%5.2f %5.2f %5.2f %5.2f\n%5.2f %5.2f %5.2f %5.2f\n%5.2f %5.2f %5.2f %5.2f\n%5.2f %5.2f %5.2f %5.2f\n&quot;</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
naive_matmul_4x4<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 address of A
       r1 address of B
       r2 address of C
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> Keep integer registers <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> First zero <span style="color: #ff0000;">16</span> single floating point <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">In</span> IEEE <span style="color: #ff0000;">754</span><span style="color: #339933;">,</span> all <span style="color: #0000ff; font-weight: bold;">bits</span> cleared means <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r2
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    b <span style="color: #339933;">.</span>Lloop_init_test
    <span style="color: #339933;">.</span>Lloop_init <span style="color: #339933;">:</span>
      <span style="color: #00007f; font-weight: bold;">str</span> r6<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #339933;">+</span>#<span style="color: #ff0000;">4</span>   <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r4 ← r6 then r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>Lloop_init_test<span style="color: #339933;">:</span>
      subs r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
      bne <span style="color: #339933;">.</span>Lloop_init
&nbsp;
    <span style="color: #339933;">/*</span> We will use 
           r4 as i
           r5 as j
           r6 as k
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span> <span style="color: #339933;">/*</span> r4 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>Lloop_i<span style="color: #339933;">:</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of i <span style="color: #339933;">*/</span>
      <span style="color: #00007f; font-weight: bold;">cmp</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> if r4 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
      beq <span style="color: #339933;">.</span>Lend_loop_i
      <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>  <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
      <span style="color: #339933;">.</span>Lloop_j<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of j <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">cmp</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span> <span style="color: #339933;">/*</span> if r5 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
        beq <span style="color: #339933;">.</span>Lend_loop_j
        <span style="color: #339933;">/*</span> Compute the address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #00007f; font-weight: bold;">and</span> load it <span style="color: #00007f; font-weight: bold;">into</span> s0 <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> is C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> i <span style="color: #339933;">+</span> j<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">mov</span> r7<span style="color: #339933;">,</span> r5               <span style="color: #339933;">/*</span> r7 ← r5<span style="color: #339933;">.</span> This is r7 ← j <span style="color: #339933;">*/</span>
        adds r7<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r7 ← r7 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r4 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> 
                                    This is r7 ← j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">.</span>
                                    We multiply i by the row size <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> elements<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        adds r7<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r7 ← r2 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r7 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span>
                                    This is r7 ← C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                                    We multiply <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> by the size of the element<span style="color: #339933;">.</span>
                                    A single<span style="color: #339933;">-</span>precision floating point takes <span style="color: #ff0000;">4</span> bytes<span style="color: #339933;">.</span>
                                    <span style="color: #339933;">*/</span>
        vldr s0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r7<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">/*</span> s0 ← <span style="color: #339933;">*</span>r7 <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span> <span style="color: #339933;">/*</span> r6 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">.</span>Lloop_k <span style="color: #339933;">:</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of k <span style="color: #339933;">*/</span>
          <span style="color: #00007f; font-weight: bold;">cmp</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span> <span style="color: #339933;">/*</span> if r6 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
          beq <span style="color: #339933;">.</span>Lend_loop_k
&nbsp;
          <span style="color: #339933;">/*</span> Compute the address of a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #00007f; font-weight: bold;">and</span> load it <span style="color: #00007f; font-weight: bold;">into</span> s1 <span style="color: #339933;">*/</span>
          <span style="color: #339933;">/*</span> Address of a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> is a <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> i <span style="color: #339933;">+</span> k<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
          <span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r6               <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r6<span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← k <span style="color: #339933;">*/</span>
          adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← <span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r4 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← k <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
          adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r0 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← a <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>k <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
          vldr s1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#93;</span>            <span style="color: #339933;">/*</span> s1 ← <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">*/</span>
&nbsp;
          <span style="color: #339933;">/*</span> Compute the address of b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #00007f; font-weight: bold;">and</span> load it <span style="color: #00007f; font-weight: bold;">into</span> s2 <span style="color: #339933;">*/</span>
          <span style="color: #339933;">/*</span> Address of b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> is b <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> k <span style="color: #339933;">+</span> j<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
          <span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r5               <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r5<span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← j <span style="color: #339933;">*/</span>
          adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← <span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r6 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← j <span style="color: #339933;">+</span> k <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
          adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← b <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> k <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
          vldr s2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#93;</span>            <span style="color: #339933;">/*</span> s1 ← <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">*/</span>
&nbsp;
          vmul<span style="color: #339933;">.</span>f32 s3<span style="color: #339933;">,</span> s1<span style="color: #339933;">,</span> s2      <span style="color: #339933;">/*</span> s3 ← s1 <span style="color: #339933;">*</span> s2 <span style="color: #339933;">*/</span>
          vadd<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s0<span style="color: #339933;">,</span> s3      <span style="color: #339933;">/*</span> s0 ← s0 <span style="color: #339933;">+</span> s3 <span style="color: #339933;">*/</span>
&nbsp;
          <span style="color: #00007f; font-weight: bold;">add</span> r6<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>           <span style="color: #339933;">/*</span> r6 ← r6 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
          b <span style="color: #339933;">.</span>Lloop_k               <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
        <span style="color: #339933;">.</span>Lend_loop_k<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
        vstr s0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r7<span style="color: #009900; font-weight: bold;">&#93;</span>            <span style="color: #339933;">/*</span> Store s0 back to C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>  <span style="color: #339933;">/*</span> r5 ← r5 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
        b <span style="color: #339933;">.</span>Lloop_j <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
       <span style="color: #339933;">.</span>Lend_loop_j<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">add</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span> <span style="color: #339933;">/*</span> r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
       b <span style="color: #339933;">.</span>Lloop_i     <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>Lend_loop_i<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Restore integer registers <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> function <span style="color: #339933;">*/</span>
&nbsp;
&nbsp;
<span style="color: #339933;">.</span>globl main
main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Keep integer registers <span style="color: #339933;">*/</span>
    vpush <span style="color: #009900; font-weight: bold;">&#123;</span>d0<span style="color: #339933;">-</span>d1<span style="color: #009900; font-weight: bold;">&#125;</span>          <span style="color: #339933;">/*</span> Keep floating point registers <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> Prepare <span style="color: #00007f; font-weight: bold;">call</span> to naive_matmul_4x4 <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> addr_mat_A  <span style="color: #339933;">/*</span> r0 ← a <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> addr_mat_B  <span style="color: #339933;">/*</span> r1 ← b <span style="color: #339933;">*/</span>
    ldr r2<span style="color: #339933;">,</span> addr_mat_C  <span style="color: #339933;">/*</span> r2 ← c <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> naive_matmul_4x4
&nbsp;
    <span style="color: #339933;">/*</span> Now print the result matrix <span style="color: #339933;">*/</span>
    ldr r4<span style="color: #339933;">,</span> addr_mat_C  <span style="color: #339933;">/*</span> r4 ← c <span style="color: #339933;">*/</span>
&nbsp;
    vldr s0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">/*</span> s0 ← <span style="color: #339933;">*</span>r4<span style="color: #339933;">.</span> This is s0 ← c<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
    vcvt<span style="color: #339933;">.</span>f64<span style="color: #339933;">.</span>f32 d1<span style="color: #339933;">,</span> s0 <span style="color: #339933;">/*</span> Convert it <span style="color: #00007f; font-weight: bold;">into</span> a double<span style="color: #339933;">-</span>precision
                           d1 ← s0
                         <span style="color: #339933;">*/</span>
    vmov r2<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> d1      <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r2<span style="color: #339933;">,</span>r3<span style="color: #009900; font-weight: bold;">&#125;</span> ← d1 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r6<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span>     <span style="color: #339933;">/*</span> Remember the <span style="color: #0000ff; font-weight: bold;">stack</span> pointer<span style="color: #339933;">,</span> we need it to restore it back later <span style="color: #339933;">*/</span>
                   <span style="color: #339933;">/*</span> r6 ← <span style="color: #46aa03; font-weight: bold;">sp</span> <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>  <span style="color: #339933;">/*</span> We will iterate from <span style="color: #ff0000;">1</span> to <span style="color: #ff0000;">15</span> <span style="color: #009900; font-weight: bold;">&#40;</span>because the 0th item has already been handled <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">60</span> <span style="color: #339933;">/*</span> Go to the last item of the matrix c<span style="color: #339933;">,</span> this is c<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>Lloop<span style="color: #339933;">:</span>
        vldr s0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">/*</span> s0 ← <span style="color: #339933;">*</span>r4<span style="color: #339933;">.</span> Load the current item <span style="color: #339933;">*/</span>
        vcvt<span style="color: #339933;">.</span>f64<span style="color: #339933;">.</span>f32 d1<span style="color: #339933;">,</span> s0 <span style="color: #339933;">/*</span> Convert it <span style="color: #00007f; font-weight: bold;">into</span> a double<span style="color: #339933;">-</span>precision
                               d1 ← s0
                             <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">sub</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">8</span>      <span style="color: #339933;">/*</span> Make room <span style="color: #00007f; font-weight: bold;">in</span> the <span style="color: #0000ff; font-weight: bold;">stack</span> for the double<span style="color: #339933;">-</span>precision <span style="color: #339933;">*/</span>
        vstr d1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>       <span style="color: #339933;">/*</span> Store the double precision <span style="color: #00007f; font-weight: bold;">in</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">sub</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>      <span style="color: #339933;">/*</span> Move to the previous element <span style="color: #00007f; font-weight: bold;">in</span> the matrix <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>      <span style="color: #339933;">/*</span> One more item has been handled <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">cmp</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span>         <span style="color: #339933;">/*</span> if r5 != <span style="color: #ff0000;">16</span> go to next iteration of the <span style="color: #00007f; font-weight: bold;">loop</span> <span style="color: #339933;">*/</span>
        bne <span style="color: #339933;">.</span>Lloop
&nbsp;
    ldr r0<span style="color: #339933;">,</span> addr_format_result <span style="color: #339933;">/*</span> r0 ← &amp;format_result <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> printf <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> r6  <span style="color: #339933;">/*</span> Restore the <span style="color: #0000ff; font-weight: bold;">stack</span> after the <span style="color: #00007f; font-weight: bold;">call</span>  <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    vpop <span style="color: #009900; font-weight: bold;">&#123;</span>d0<span style="color: #339933;">-</span>d1<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr
&nbsp;
addr_mat_A <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> mat_A
addr_mat_B <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> mat_B
addr_mat_C <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> mat_C
addr_format_result <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> format_result</pre></td></tr></table></div>

<p>
That's a lot of code but it is not complicated. Unfortunately computing the address of the array takes an important number of instructions. In our <code>naive_matmul_4x4</code> we have the three loops <code>i</code>, <code>j</code> and <code>k</code> of the C algorithm. We compute the address of <code>C[i][j]</code> in the loop <code>j</code> (there is no need to compute it every time in the loop <code>k</code>) in lines 52 to 63. The value contained in <code>C[i][j]</code> is then loaded into <code>s0</code>. In each iteration of loop <code>k</code> we load <code>A[i][k]</code> and <code>B[k][j]</code> in <code>s1</code> and <code>s2</code> respectively (lines 70 to 82). After the loop <code>k</code> ends, we can store <code>s0</code> back to the array position (kept in <code>r7</code>, line 90)
</p>
<p>
In order to print the result matrix we have to pass 16 floating points to <code>printf</code>. Unfortunately, as stated in chapter 13, we have to first convert them into double-precision before passing them. Note also that the first single-precision can be passed using registers <code>r2</code> and <code>r3</code>. All the remaining must be passed on the stack and do not forget that the stack parameters must be passed in opposite order. This is why once the first element of the C matrix has been loaded in <code>{r2,r3}</code> (lines 117 to 120) we advance 60 bytes r4. This is <code>C[3][3]</code>, the last element of the matrix C. We load the single-precision, convert it into double-precision, push it in the stack and then move backwards register <code>r4</code>, to the previous element in the matrix (lines 128 to 137). Observe that we use <code>r6</code> as a marker of the stack, since we need to restore the stack once <code>printf</code> returns (line 122 and line 141). Of course we could avoid using <code>r6</code> and instead do <code>add sp, sp, #120</code> since this is exactly the amount of bytes we push to the stack (15 values of double-precision, each taking 8 bytes).
</p>
<p>
I have not chosen the values of the two matrices randomly. The second one is (approximately) the inverse of the first. This way we will get the identity matrix (a matrix with all zeros but a diagonal of ones). Due to rounding issues the result matrix will not be the identity, but it will be pretty close. Let's run the program.
</p>
<pre>
$ ./matmul 
Matrix result is:
 1.00 -0.00  0.00  0.00
-0.00  1.00  0.00 -0.00
 0.00  0.00  1.00  0.00
 0.00 -0.00  0.00  1.00
</pre>
<h2>Vectorial approach</h2>
<p>
The algorithm we are trying to implement is fine but it is not the most optimizable. The problem lies in the way the loop <code>k</code> accesses the elements. Access <code>A[i][k]</code> is eligible for a multiple load as <code>A[i][k]</code> and <code>A[i][k+1]</code> are contiguous elements in memory. This way we could entirely avoid all the loop <code>k</code> and perform a 4 element load from <code>A[i][0]</code> to <code>A[i][3]</code>. The access <code>B[k][j]</code> does not allow that since elements <code>B[k][j]</code> and <code>B[k+1][j]</code> have a full row inbetween. This is a <em>strided access</em> (the stride here is a full row of 4 elements, this is 16 bytes), VFPv2 does not allow a strided multiple load, so we will have to load one by one.. Once we have all the elements of the loop <code>k</code> loaded, we can do a vector multiplication and a sum.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">naive_vectorial_matmul_4x4<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 address of A
       r1 address of B
       r2 address of C
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> Keep integer registers <span style="color: #339933;">*/</span>
    vpush <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>               <span style="color: #339933;">/*</span> Floating point registers starting from s16 must be preserved <span style="color: #339933;">*/</span>
    vpush <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">-</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #339933;">/*</span> First zero <span style="color: #ff0000;">16</span> single floating point <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">In</span> IEEE <span style="color: #ff0000;">754</span><span style="color: #339933;">,</span> all <span style="color: #0000ff; font-weight: bold;">bits</span> cleared means <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r2
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    b <span style="color: #339933;">.</span>L1_loop_init_test
    <span style="color: #339933;">.</span>L1_loop_init <span style="color: #339933;">:</span>
      <span style="color: #00007f; font-weight: bold;">str</span> r6<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #339933;">+</span>#<span style="color: #ff0000;">4</span>   <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r4 ← r6 then r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L1_loop_init_test<span style="color: #339933;">:</span>
      subs r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
      bne <span style="color: #339933;">.</span>L1_loop_init
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR to be <span style="color: #ff0000;">4</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    orr r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 | r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> We will use 
           r4 as i
           r5 as j
           r6 as k
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span> <span style="color: #339933;">/*</span> r4 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L1_loop_i<span style="color: #339933;">:</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of i <span style="color: #339933;">*/</span>
      <span style="color: #00007f; font-weight: bold;">cmp</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> if r4 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
      beq <span style="color: #339933;">.</span>L1_end_loop_i
      <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>  <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
      <span style="color: #339933;">.</span>L1_loop_j<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of j <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">cmp</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span> <span style="color: #339933;">/*</span> if r5 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
        beq <span style="color: #339933;">.</span>L1_end_loop_j
        <span style="color: #339933;">/*</span> Compute the address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #00007f; font-weight: bold;">and</span> load it <span style="color: #00007f; font-weight: bold;">into</span> s0 <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> is C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> i <span style="color: #339933;">+</span> j<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">mov</span> r7<span style="color: #339933;">,</span> r5               <span style="color: #339933;">/*</span> r7 ← r5<span style="color: #339933;">.</span> This is r7 ← j <span style="color: #339933;">*/</span>
        adds r7<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r7 ← r7 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r4 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> 
                                    This is r7 ← j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">.</span>
                                    We multiply i by the row size <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> elements<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        adds r7<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r7 ← r2 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r7 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span>
                                    This is r7 ← C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                                    We multiply <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> by the size of the element<span style="color: #339933;">.</span>
                                    A single<span style="color: #339933;">-</span>precision floating point takes <span style="color: #ff0000;">4</span> bytes<span style="color: #339933;">.</span>
                                    <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Compute the address of a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>
        adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>
        vldmia <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">-</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Load <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #339933;">/*</span> Compute the address of b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r5               <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r5<span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← j <span style="color: #339933;">*/</span>
        adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← b <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>j<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vldr s16<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#93;</span>             <span style="color: #339933;">/*</span> s16 ← <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">.</span> This is s16 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s17<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s17 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s17 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s18<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">32</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s18 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">32</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s17 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s19<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">48</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s19 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">48</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s17 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
&nbsp;
        vmul<span style="color: #339933;">.</span>f32 s24<span style="color: #339933;">,</span> s8<span style="color: #339933;">,</span> s16      <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s25<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
        vmov<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s24           <span style="color: #339933;">/*</span> s0 ← s24 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s0<span style="color: #339933;">,</span> s25       <span style="color: #339933;">/*</span> s0 ← s0 <span style="color: #339933;">+</span> s25 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s0<span style="color: #339933;">,</span> s26       <span style="color: #339933;">/*</span> s0 ← s0 <span style="color: #339933;">+</span> s26 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s0<span style="color: #339933;">,</span> s27       <span style="color: #339933;">/*</span> s0 ← s0 <span style="color: #339933;">+</span> s27 <span style="color: #339933;">*/</span>
&nbsp;
        vstr s0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r7<span style="color: #009900; font-weight: bold;">&#93;</span>            <span style="color: #339933;">/*</span> Store s0 back to C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>  <span style="color: #339933;">/*</span> r5 ← r5 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
        b <span style="color: #339933;">.</span>L1_loop_j <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
       <span style="color: #339933;">.</span>L1_end_loop_j<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">add</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span> <span style="color: #339933;">/*</span> r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
       b <span style="color: #339933;">.</span>L1_loop_i     <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L1_end_loop_i<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR back to <span style="color: #ff0000;">1</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    mvn r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← ~<span style="color: #009900; font-weight: bold;">&#40;</span>r5 &lt;&lt; <span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">and</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 &amp; r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    vpop <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">-</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span>                <span style="color: #339933;">/*</span> Restore preserved floating registers <span style="color: #339933;">*/</span>
    vpop <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Restore integer registers <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> function <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
With this approach we can entirely remove the loop <code>k</code>, as we do 4 operations at once. Note that we have to modify <code>fpscr</code> so the field <code>len</code> is set to 4 (and restore it back to 1 when leaving the function).
</p>
<h3>Fill the registers</h3>
<p>
In the previous version we are not exploiting all the registers of VFPv2. Each rows takes 4 registers and so does each column, so we end using only 8 registers plus 4 for the result and one in the bank 0 for the summation. We got rid the loop <code>k</code> to process <code>C[i][j]</code> at once. What if we processed <code>C[i][j]</code> and <code>C[i][j+1]</code> at the same time? This way we can fill all the 8 registers in each bank.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">naive_vectorial_matmul_2_4x4<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 address of A
       r1 address of B
       r2 address of C
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> Keep integer registers <span style="color: #339933;">*/</span>
    vpush <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s31<span style="color: #009900; font-weight: bold;">&#125;</span>               <span style="color: #339933;">/*</span> Floating point registers starting from s16 must be preserved <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> First zero <span style="color: #ff0000;">16</span> single floating point <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">In</span> IEEE <span style="color: #ff0000;">754</span><span style="color: #339933;">,</span> all <span style="color: #0000ff; font-weight: bold;">bits</span> cleared means <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r2
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    b <span style="color: #339933;">.</span>L2_loop_init_test
    <span style="color: #339933;">.</span>L2_loop_init <span style="color: #339933;">:</span>
      <span style="color: #00007f; font-weight: bold;">str</span> r6<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #339933;">+</span>#<span style="color: #ff0000;">4</span>   <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r4 ← r6 then r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L2_loop_init_test<span style="color: #339933;">:</span>
      subs r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
      bne <span style="color: #339933;">.</span>L2_loop_init
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR to be <span style="color: #ff0000;">4</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    orr r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 | r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> We will use 
           r4 as i
           r5 as j
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span> <span style="color: #339933;">/*</span> r4 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L2_loop_i<span style="color: #339933;">:</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of i <span style="color: #339933;">*/</span>
      <span style="color: #00007f; font-weight: bold;">cmp</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> if r4 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
      beq <span style="color: #339933;">.</span>L2_end_loop_i
      <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>  <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
      <span style="color: #339933;">.</span>L2_loop_j<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of j <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">cmp</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span> <span style="color: #339933;">/*</span> if r5 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
        beq <span style="color: #339933;">.</span>L2_end_loop_j
        <span style="color: #339933;">/*</span> Compute the address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #00007f; font-weight: bold;">and</span> load it <span style="color: #00007f; font-weight: bold;">into</span> s0 <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> is C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> i <span style="color: #339933;">+</span> j<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">mov</span> r7<span style="color: #339933;">,</span> r5               <span style="color: #339933;">/*</span> r7 ← r5<span style="color: #339933;">.</span> This is r7 ← j <span style="color: #339933;">*/</span>
        adds r7<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r7 ← r7 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r4 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> 
                                    This is r7 ← j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">.</span>
                                    We multiply i by the row size <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> elements<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        adds r7<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r7 ← r2 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r7 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span>
                                    This is r7 ← C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                                    We multiply <span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> i <span style="color: #339933;">*</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> by the size of the element<span style="color: #339933;">.</span>
                                    A single<span style="color: #339933;">-</span>precision floating point takes <span style="color: #ff0000;">4</span> bytes<span style="color: #339933;">.</span>
                                    <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Compute the address of a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>
        adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>
        vldmia <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">-</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Load <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #339933;">/*</span> Compute the address of b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">mov</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r5               <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r5<span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← j <span style="color: #339933;">*/</span>
        adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← b <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>j<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vldr s16<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#93;</span>             <span style="color: #339933;">/*</span> s16 ← <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">.</span> This is s16 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s17<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s17 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s17 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s18<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">32</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s18 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">32</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s17 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s19<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">48</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s19 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">48</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s17 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #339933;">/*</span> Compute the address of b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r5 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← j <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">*/</span>
        adds <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>    <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← b <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>j <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vldr s20<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#93;</span>             <span style="color: #339933;">/*</span> s20 ← <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">.</span> This is s20 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s21<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s21 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s21 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s22<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">32</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s22 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">32</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s22 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s23<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">48</span><span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> s23 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #46aa03; font-weight: bold;">r8</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">48</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is s23 ← b<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
&nbsp;
        vmul<span style="color: #339933;">.</span>f32 s24<span style="color: #339933;">,</span> s8<span style="color: #339933;">,</span> s16      <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s25<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
        vmov<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s24           <span style="color: #339933;">/*</span> s0 ← s24 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s0<span style="color: #339933;">,</span> s25       <span style="color: #339933;">/*</span> s0 ← s0 <span style="color: #339933;">+</span> s25 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s0<span style="color: #339933;">,</span> s26       <span style="color: #339933;">/*</span> s0 ← s0 <span style="color: #339933;">+</span> s26 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s0<span style="color: #339933;">,</span> s0<span style="color: #339933;">,</span> s27       <span style="color: #339933;">/*</span> s0 ← s0 <span style="color: #339933;">+</span> s27 <span style="color: #339933;">*/</span>
&nbsp;
        vmul<span style="color: #339933;">.</span>f32 s28<span style="color: #339933;">,</span> s8<span style="color: #339933;">,</span> s20      <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s28<span style="color: #339933;">,</span>s29<span style="color: #339933;">,</span>s30<span style="color: #339933;">,</span>s31<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s20<span style="color: #339933;">,</span>s21<span style="color: #339933;">,</span>s22<span style="color: #339933;">,</span>s23<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        vmov<span style="color: #339933;">.</span>f32 s1<span style="color: #339933;">,</span> s28           <span style="color: #339933;">/*</span> s1 ← s28 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s1<span style="color: #339933;">,</span> s1<span style="color: #339933;">,</span> s29       <span style="color: #339933;">/*</span> s1 ← s1 <span style="color: #339933;">+</span> s29 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s1<span style="color: #339933;">,</span> s1<span style="color: #339933;">,</span> s30       <span style="color: #339933;">/*</span> s1 ← s1 <span style="color: #339933;">+</span> s30 <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s1<span style="color: #339933;">,</span> s1<span style="color: #339933;">,</span> s31       <span style="color: #339933;">/*</span> s1 ← s1 <span style="color: #339933;">+</span> s31 <span style="color: #339933;">*/</span>
&nbsp;
        vstmia r7<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">-</span>s1<span style="color: #009900; font-weight: bold;">&#125;</span>         <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>j<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">,</span> s1<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">add</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r5 ← r5 <span style="color: #339933;">+</span> <span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
        b <span style="color: #339933;">.</span>L2_loop_j <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
       <span style="color: #339933;">.</span>L2_end_loop_j<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> j <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">add</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span> <span style="color: #339933;">/*</span> r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
       b <span style="color: #339933;">.</span>L2_loop_i     <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L2_end_loop_i<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR back to <span style="color: #ff0000;">1</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    mvn r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">and</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 &amp; r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    vpop <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s31<span style="color: #009900; font-weight: bold;">&#125;</span>                <span style="color: #339933;">/*</span> Restore preserved floating registers <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Restore integer registers <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> function <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Note that because we now process <code>j</code> and <code>j + 1</code>, <code>r5</code> (<code>j</code>) is now increased by 2 at the end of the loop. This is usually known as <em>loop unrolling</em> and it is always legal to do. We do more than one iteration of the original loop in the unrolled loop. The amount of iterations of the original loop we do in the unrolled loop is the <em>unroll factor</em>. In this case since the number of iterations (4) perfectly divides the unrolling factor (2) we do not need an extra loop for the remainder iterations (the remainder loop has one less iteration than the value of the unrolling factor).
</p>
<p>
As you can see, the accesses to <code>b[k][j]</code> and <code>b[k][j+1]</code> are starting to become tedious. Maybe we should change a bit more the matrix multiply algorithm.
</p>
<h2>Reorder the accesses</h2>
<p>
Is there a way we can mitigate the strided accesses to the matrix B? Yes, there is one, we only have to permute the loop nest i, j, k into the loop nest k, i, j. Now you may be wondering if this is legal. Well, checking for the legality of these things is beyond the scope of this post so you will have to trust me here. Such permutation is fine. What does this mean? Well, it means that our algorithm will now look like this.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">float</span> A<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">float</span> B<span style="color: #009900;">&#91;</span>M<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// Result</span>
<span style="color: #993333;">float</span> C<span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>N<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> j <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
    C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> k <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> k <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> k<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
    <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> j <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
       C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
This may seem not very useful, but note that, since now k is in the outermost loop, now it is easier to use vectorial instructions.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> k <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> k <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> k<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
     C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
     C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
     C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
     C<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">3</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> A<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">*</span> B<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">3</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>
If you remember the chapter 13, VFPv2 instructions have a mixed mode when the <code>Rsource2</code> register is in bank 0. This case makes a perfect match: we can load <code>C[i][0..3]</code> and <code>B[k][0..3]</code> with a load multiple and then load <code>A[i][k]</code> in a register of the bank 0. Then we can make multiply<code> A[i][k]*B[k][0..3]</code> and add the result to <code>C[i][0..3]</code>. As a bonus, the number of instructions is much lower.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">better_vectorial_matmul_4x4<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 address of A
       r1 address of B
       r2 address of C
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> Keep integer registers <span style="color: #339933;">*/</span>
    vpush <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>               <span style="color: #339933;">/*</span> Floating point registers starting from s16 must be preserved <span style="color: #339933;">*/</span>
    vpush <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">-</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #339933;">/*</span> First zero <span style="color: #ff0000;">16</span> single floating point <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">In</span> IEEE <span style="color: #ff0000;">754</span><span style="color: #339933;">,</span> all <span style="color: #0000ff; font-weight: bold;">bits</span> cleared means <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r2
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    b <span style="color: #339933;">.</span>L3_loop_init_test
    <span style="color: #339933;">.</span>L3_loop_init <span style="color: #339933;">:</span>
      <span style="color: #00007f; font-weight: bold;">str</span> r6<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #339933;">+</span>#<span style="color: #ff0000;">4</span>   <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r4 ← r6 then r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L3_loop_init_test<span style="color: #339933;">:</span>
      subs r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
      bne <span style="color: #339933;">.</span>L3_loop_init
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR to be <span style="color: #ff0000;">4</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    orr r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 | r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> We will use 
           r4 as k
           r5 as i
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span> <span style="color: #339933;">/*</span> r4 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L3_loop_k<span style="color: #339933;">:</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of k <span style="color: #339933;">*/</span>
      <span style="color: #00007f; font-weight: bold;">cmp</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> if r4 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
      beq <span style="color: #339933;">.</span>L3_end_loop_k
      <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>  <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
      <span style="color: #339933;">.</span>L3_loop_i<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of i <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">cmp</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span> <span style="color: #339933;">/*</span> if r5 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
        beq <span style="color: #339933;">.</span>L3_end_loop_i
        <span style="color: #339933;">/*</span> Compute the address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> is C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> r7<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">4</span>         <span style="color: #339933;">/*</span> r7 ← r2 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r5 &lt;&lt; <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is r7 ← c <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i <span style="color: #339933;">*/</span>
        vldmia r7<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">-</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span>            <span style="color: #339933;">/*</span> Load <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Compute the address of A<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of A<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> is A <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i <span style="color: #339933;">+</span> k<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>         <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r4 <span style="color: #339933;">+</span> r5 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← k <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>         <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r0 <span style="color: #339933;">+</span> <span style="color: #46aa03; font-weight: bold;">r8</span> &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← a <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>k <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vldr s0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#93;</span>                  <span style="color: #339933;">/*</span> Load s0 ← a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #339933;">/*</span> Compute the address of B<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of B<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> is B <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>k<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">4</span>         <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r1 <span style="color: #339933;">+</span> r4 &lt;&lt; <span style="color: #ff0000;">4</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← b <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>k<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vldmia <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>           <span style="color: #339933;">/*</span> Load <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        vmul<span style="color: #339933;">.</span>f32 s24<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s0          <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s25<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
        vadd<span style="color: #339933;">.</span>f32 s8<span style="color: #339933;">,</span> s8<span style="color: #339933;">,</span> s24           <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s25<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s7<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        vstmia r7<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">-</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span>            <span style="color: #339933;">/*</span> Store <span style="color: #009900; font-weight: bold;">&#123;</span>c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">add</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>  <span style="color: #339933;">/*</span> r5 ← r5 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is i = i <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
        b <span style="color: #339933;">.</span>L3_loop_i <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
       <span style="color: #339933;">.</span>L3_end_loop_i<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">add</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span> <span style="color: #339933;">/*</span> r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is k = k <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
       b <span style="color: #339933;">.</span>L3_loop_k     <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L3_end_loop_k<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR back to <span style="color: #ff0000;">1</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    mvn r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">and</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 &amp; r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    vpop <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">-</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span>                <span style="color: #339933;">/*</span> Restore preserved floating registers <span style="color: #339933;">*/</span>
    vpop <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Restore integer registers <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> function <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
As adding after a multiplication is a relatively usual sequence, we can replace the sequence
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>55
56
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">vmul<span style="color: #339933;">.</span>f32 s24<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s0          <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s25<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s27<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
vadd<span style="color: #339933;">.</span>f32 s8<span style="color: #339933;">,</span> s8<span style="color: #339933;">,</span> s24           <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s25<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s7<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
with a single instruction <code>vmla</code> (multiply and add).
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>55
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">vmla<span style="color: #339933;">.</span>f32 s8<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s0          <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #009900; font-weight: bold;">&#125;</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Now we can also unroll the loop <code>i</code>, again with an unrolling factor of 2. This would give us the <em>best</em> version.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">best_vectorial_matmul_4x4<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 address of A
       r1 address of B
       r2 address of C
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> Keep integer registers <span style="color: #339933;">*/</span>
    vpush <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>               <span style="color: #339933;">/*</span> Floating point registers starting from s16 must be preserved <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> First zero <span style="color: #ff0000;">16</span> single floating point <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">In</span> IEEE <span style="color: #ff0000;">754</span><span style="color: #339933;">,</span> all <span style="color: #0000ff; font-weight: bold;">bits</span> cleared means <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r2
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r6<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    b <span style="color: #339933;">.</span>L4_loop_init_test
    <span style="color: #339933;">.</span>L4_loop_init <span style="color: #339933;">:</span>
      <span style="color: #00007f; font-weight: bold;">str</span> r6<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #339933;">+</span>#<span style="color: #ff0000;">4</span>   <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r4 ← r6 then r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L4_loop_init_test<span style="color: #339933;">:</span>
      subs r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
      bne <span style="color: #339933;">.</span>L4_loop_init
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR to be <span style="color: #ff0000;">4</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    orr r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 | r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> We will use 
           r4 as k
           r5 as i
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span> <span style="color: #339933;">/*</span> r4 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L4_loop_k<span style="color: #339933;">:</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of k <span style="color: #339933;">*/</span>
      <span style="color: #00007f; font-weight: bold;">cmp</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> if r4 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
      beq <span style="color: #339933;">.</span>L4_end_loop_k
      <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>  <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
      <span style="color: #339933;">.</span>L4_loop_i<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">loop</span> header of i <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">cmp</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span> <span style="color: #339933;">/*</span> if r5 == <span style="color: #ff0000;">4</span> goto end of the <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
        beq <span style="color: #339933;">.</span>L4_end_loop_i
        <span style="color: #339933;">/*</span> Compute the address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of C<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> is C <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> r7<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">4</span>         <span style="color: #339933;">/*</span> r7 ← r2 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r5 &lt;&lt; <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is r7 ← c <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i <span style="color: #339933;">*/</span>
        vldmia r7<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">-</span>s15<span style="color: #009900; font-weight: bold;">&#125;</span>            <span style="color: #339933;">/*</span> Load <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #339933;">,</span>s12<span style="color: #339933;">,</span>s13<span style="color: #339933;">,</span>s14<span style="color: #339933;">,</span>s15<span style="color: #009900; font-weight: bold;">&#125;</span> 
                                            ← <span style="color: #009900; font-weight: bold;">&#123;</span>c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span>   c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span>   c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span>   c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span>
                                               c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Compute the address of A<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of A<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> is A <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i <span style="color: #339933;">+</span> k<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>         <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r4 <span style="color: #339933;">+</span> r5 &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← k <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>         <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r0 <span style="color: #339933;">+</span> <span style="color: #46aa03; font-weight: bold;">r8</span> &lt;&lt; <span style="color: #ff0000;">2</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← a <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>k <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>i<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vldr s0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#93;</span>                  <span style="color: #339933;">/*</span> Load s0 ← a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        vldr s1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#93;</span>             <span style="color: #339933;">/*</span> Load s1 ← a<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #339933;">/*</span> Compute the address of B<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #339933;">*/</span>
        <span style="color: #339933;">/*</span> Address of B<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span> is B <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>k<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">4</span>         <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> ← r1 <span style="color: #339933;">+</span> r4 &lt;&lt; <span style="color: #ff0000;">4</span><span style="color: #339933;">.</span> This is <span style="color: #46aa03; font-weight: bold;">r8</span> ← b <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>k<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vldmia <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>           <span style="color: #339933;">/*</span> Load <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> b<span style="color: #009900; font-weight: bold;">&#91;</span>k<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        vmla<span style="color: #339933;">.</span>f32 s8<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s0           <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #339933;">,</span>s0<span style="color: #009900; font-weight: bold;">&#125;</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
        vmla<span style="color: #339933;">.</span>f32 s12<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s1          <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s12<span style="color: #339933;">,</span>s13<span style="color: #339933;">,</span>s14<span style="color: #339933;">,</span>s15<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s12<span style="color: #339933;">,</span>s13<span style="color: #339933;">,</span>s14<span style="color: #339933;">,</span>s15<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s17<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s1<span style="color: #339933;">,</span>s1<span style="color: #339933;">,</span>s1<span style="color: #339933;">,</span>s1<span style="color: #009900; font-weight: bold;">&#125;</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
&nbsp;
        vstmia r7<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">-</span>s15<span style="color: #009900; font-weight: bold;">&#125;</span>            <span style="color: #339933;">/*</span> Store <span style="color: #009900; font-weight: bold;">&#123;</span>c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span>   c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span>   c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span>    c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span>
                                                 c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#125;</span><span style="color: #339933;">,</span> c<span style="color: #009900; font-weight: bold;">&#91;</span>i<span style="color: #339933;">+</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #ff0000;">3</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #009900; font-weight: bold;">&#125;</span>
                                                ← <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s9<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s11<span style="color: #339933;">,</span>s12<span style="color: #339933;">,</span>s13<span style="color: #339933;">,</span>s14<span style="color: #339933;">,</span>s15<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">add</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r5 ← r5 <span style="color: #339933;">+</span> <span style="color: #ff0000;">2</span><span style="color: #339933;">.</span> This is i = i <span style="color: #339933;">+</span> <span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
        b <span style="color: #339933;">.</span>L4_loop_i <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
       <span style="color: #339933;">.</span>L4_end_loop_i<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> i <span style="color: #339933;">*/</span>
       <span style="color: #00007f; font-weight: bold;">add</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span> <span style="color: #339933;">/*</span> r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is k = k <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
       b <span style="color: #339933;">.</span>L4_loop_k     <span style="color: #339933;">/*</span> next iteration of <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
    <span style="color: #339933;">.</span>L4_end_loop_k<span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> Here ends <span style="color: #00007f; font-weight: bold;">loop</span> k <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> Set the LEN field of FPSCR back to <span style="color: #ff0000;">1</span> <span style="color: #009900; font-weight: bold;">&#40;</span>value <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #0b011                        <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
    mvn r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
    fmrx r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">and</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 &amp; r5 <span style="color: #339933;">*/</span>
    fmxr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span>
&nbsp;
    vpop <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">-</span>s19<span style="color: #009900; font-weight: bold;">&#125;</span>                <span style="color: #339933;">/*</span> Restore preserved floating registers <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Restore integer registers <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> function <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h2>Comparing versions</h2>
<p>
Out of curiosity I tested the versions, to see which one was faster.
</p>
<p>
The benchmark consists on repeatedly calling the multiplication matrix function 2<sup>21</sup> times in order to magnify differences. While the input should be randomized as well for a better benchmark, the benchmark more or less models contexts where a matrix multiplication is performed many times (for instance in graphics).
</p>
<p>
This is the skeleton of the benchmark.</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> addr_mat_A  <span style="color: #339933;">/*</span> r0 ← a <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> addr_mat_B  <span style="color: #339933;">/*</span> r1 ← b <span style="color: #339933;">*/</span>
    ldr r2<span style="color: #339933;">,</span> addr_mat_C  <span style="color: #339933;">/*</span> r2 ← c <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">21</span>
    <span style="color: #339933;">.</span>Lmain_loop_test<span style="color: #339933;">:</span> 
      <span style="color: #46aa03; font-weight: bold;">bl</span> &lt;&lt;tested<span style="color: #339933;">-</span>matmul<span style="color: #339933;">-</span>routine&gt;&gt; <span style="color: #339933;">/*</span> Change here with the matmul you want to <span style="color: #00007f; font-weight: bold;">test</span> <span style="color: #339933;">*/</span>
      subs r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
      bne <span style="color: #339933;">.</span>Lmain_loop_test
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>

<p>
Here are the results. The one we named the best turned to actually deserve that name.
</p>
<table>
<tr>
<th>Version</th>
<th>Time (seconds)</th>
</tr>
<tr>
<td>naive_matmul_4x4</td>
<td>6.41</td>
</tr>
<tr>
<td>naive_vectorial_matmul_4x4</td>
<td>3.51</td>
</tr>
<tr>
<td>nnaive_vectorial_matmul_2_4x4</td>
<td>2.87</td>
</tr>
<tr>
<td>better_vectorial_matmul_4x4</td>
<td>2.59</td>
</tr>
<tr>
<td>best_vectorial_matmul_4x4</td>
<td>1.51</td>
</tr>
</table>
<p>
That's all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F05%2F12%2Farm-assembler-raspberry-pi-chapter-14%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%2014" id="wpa2a_2"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/05/12/arm-assembler-raspberry-pi-chapter-14/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 13</title>
		<link>http://thinkingeek.com/2013/05/12/arm-assembler-raspberry-pi-chapter-13/</link>
		<comments>http://thinkingeek.com/2013/05/12/arm-assembler-raspberry-pi-chapter-13/#comments</comments>
		<pubDate>Sun, 12 May 2013 14:33:12 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=911</guid>
		<description><![CDATA[So far, all examples have dealt with integer values. But processors would be rather limited if they were only able to work with integer values. Fortunately they can work with floating point numbers. In this chapter we will see how we can use the floating point facilities of our Raspberry Pi. Floating point numbers Following [...]]]></description>
				<content:encoded><![CDATA[<p>
So far, all examples have dealt with integer values. But processors would be rather limited if they were only able to work with integer values. Fortunately they can work with floating point numbers. In this chapter we will see how we can use the floating point facilities of our Raspberry Pi.
</p>
<p><span id="more-911"></span></p>
<h2>Floating point numbers</h2>
<p>
Following is a quick recap of what is a floating point number.
</p>
<p>
A <em>binary floating point number</em> is an approximate representation of a real number with three parts: <em>sign</em>, <em>mantissa</em> and <em>exponent</em>. The <em>sign</em> may be just 0 or 1, meaning 1 a negative number, positive otherwise. The <em>mantissa</em> represents a fractional magnitude. Similarly to 1.2345 we can have a binary <code>1.01110</code> where every digit is just a bit. The dot means where the integer part ends and the fractional part starts. Note that there is nothing special in binary fractional numbers: <code>1.01110</code> is just 2<sup>0</sup> + 2<sup>-2</sup> + 2<sup>-3</sup> + 2<sup>-4</sup> = 1.43750<sub>(10</sub>. Usually numbers are normalized, this means that the mantissa is adjusted so the integer part is always 1, so instead of <em>0.00110101</em> we would represent <em>1.101101</em> (in fact a floating point may be a <em>denormal</em> if this property does not hold, but such numbers lie in a very specific range so we can ignore them here). If the mantissa is adjusted so it always has a single 1 as the integer part two things happen. First, we do not represent the integer part (as it is always 1 in normalized numbers). Second, to make things sound we need an <em>exponent</em> which compensates the mantissa being normalized. This means that the number -101.110111 (remember that it is a binary real number) will be represented by a sign = 1, mantissa = 1.01110111 and exponent = 2 (because we moved the dot 2 digits to the left). Similarly, number 0.0010110111 is represented with a sign = 0, mantissa = 1.0110111 and exponent = -3 (we moved the dot 3 digits to the right).
</p>
<p>
In order for different computers to be able to share floating point numbers, IEEE 754 standardizes the format of a floating point number. VFPv2 supports two of the IEEE 754 numbers: Binary32 and Binary64, usually known by their C types, <code>float</code> and <code>double</code>, or by single- and double-precision, respectively. In a <a href="http://en.wikipedia.org/wiki/Single_precision_floating-point_format" title="Single-precision floating-point format">single-precision floating point</a> the mantissa is 23 bits (+1 of the integer one for normalized numbers) and the exponent is 8 bits (so the exponent ranges from -126 to 127). In a<a href="http://en.wikipedia.org/wiki/Double_precision_floating-point_format" title="Double-precision floating-point format"> double-precision floating point</a> the mantissa is 53 bits (+1) and the exponent is 11 bits (so the exponent ranges from -1022 to 1023). A single-precision floating point number occupies 32 bit and a double-precision floating point number occupies 64 bits. Operating double-precision numbers is in average one and a half to twice slower than single-precision.
</p>
<p>
<a href="http://cr.yp.to/2005-590/goldberg.pdf" title="What Every Computer Scientist Should Know About Floating Point Arithmetic">Goldberg&#8217;s famous paper</a> is a classical reference that should be read by anyone serious when using floating point numbers.
</p>
<h2>Coprocessors</h2>
<p>
As I stated several times in earlier chapters, ARM was designed to be very flexible. We can see this in the fact that ARM architecture provides a generic coprocessor interface. Manufacturers of system-on-chips may bundle additional coprocessors. Each coprocessor is identified by a number and provides specific instructions. For instance the Raspberry Pi SoC is a BCM2835 which provides a multimedia coprocessor (which we will not discuss here).
</p>
<p>
That said, there are two standard coprocessors in the ARMv6 architecture: 10 and 11. These two coprocessors provide floating point support for single and double precision, respectively. Although the floating point instructions have their own specific names, they are actually mapped to generic coprocessor instructions targeting coprocessor 10 and 11.
</p>
<h2>Vector Floating-point v2</h2>
<p>
ARMv6 defines a floating point subarchitecture called the Vector Floating-point v2 (VFPv2). Version 2 because earlier ARM architectures supported a simpler form called now v1. As stated above, the VFP is implemented on top of two standarized coprocessors 10 and 11. ARMv6 does not require VFPv2 be implemented in hardware (one can always resort to a slower software implementation). Fortunately, the Raspberry Pi does provide a hardware implementation of VFPv2.
</p>
<h2>VFPv2 Registers</h2>
<p>
We already know that the ARM architecture provides 16 general purpose registers <code>r0</code> to <code>r15</code>, where some of them play special roles: <code>r13</code>, <code>r14</code> and <code>r15</code>. Despite their name, these general purpose registers do not allow operating floating point numbers in them, so VFPv2 provides us with some specific registers. These registers are named <code>s0</code> to <code>s31</code>, for single-precision, and <code>d0</code> to <code>d15</code> for double precision. These are not 48 different registers. Instead every <code>d<sub>n</sub></code> is mapped to two consecutive <code>s<sub>n</sub></code> and <code>s<sub>n+1</sub></code>, where <code>n</code> is an even number lower than 31.
</p>
<p>
These registers are structured in 4 banks: <code>s0</code>-<code>s7</code> (<code>d0</code>-<code>d3</code>), <code>s8</code>-<code>s15</code> (<code>d4</code>-<code>d7</code>), <code>s16</code>-<code>s23</code> (<code>d8</code>-<code>d11</code>) and <code>s24</code>-<code>s31</code> (<code>d12</code>-<code>d15</code>). We will call the first bank (bank 0, <code>s0</code>-<code>s7</code>, <code>d0</code>-<code>d3</code>) the <em>scalar</em> bank, while the remaining three are <em>vectorial</em> banks (below we will see why).
</p>
<p><img src="http://thinkingeek.com/wp-content/uploads/2013/04/vfp-registers.png" alt="vfp-registers" width="558" height="387" class="aligncenter size-full wp-image-976" /></p>
<p>
VFPv2 provides three control registers but we will only be interested in one called <code>fpscr</code>. This register is similar to the <code>cpsr</code> as it keeps the usual comparison flags <code>N</code>, <code>Z</code>, <code>C</code> and <code>V</code>. It also stores two fields that are very useful, <code>len</code> and <code>stride</code>. These two fields control how floating point instructions behave. We will not care very much of the remaining information in this register: status information of the floating point exceptions, the current rounding mode and whether denormal numbers are flushed to zero.
</p>
<h2>Arithmetic operations</h2>
<p>
Most VFPv2 instructions are of the form <code>f<em>name</em> Rdest, Rsource1, Rsource2</code> or <code>f<em>name</em> Rdest, Rsource1</code>. They have three modes of operation.
</p>
<ul>
<li>Scalar. This mode is used when the destination register is in bank 0 (<code>s0</code>-<code>s7</code> or <code>d0</code>-<code>d3</code>). In this case, the instruction operates only with <code>Rsource1</code> and <code>Rsource2</code>. No other registers are involved.
<li>Vectorial. This mode is used when the destination register and Rsource2 (or Rsource1 for instructions with only one source register) are not in the bank 0. In this case the instruction will operate as many registers (starting from the given register in the instruction and wrapping around the bank of the register) as defined in field <code>len</code> of the <code>fpscr</code> (at least 1). The next register operated is defined by the <code>stride</code> field of the <code>fpscr</code> (at least 1). If wrap-around happens, no register can be operated twice.
<li>Scalar expanded (also called <em>mixed vector/scalar</em>). This mode is used if Rsource2 (or Rsource1 if the instruction only has one source register) is in the bank0, but the destination is not. In this case Rsource2 (or Rsource1 for instructions with only one source) is left fixed as the source. The remaining registers are operated as in the vectorial case (this is, using <code>len</code> and <code>stride</code> from the <code>fpscr</code>).
</ul>
<p>
Ok, this looks pretty complicated, so let&#8217;s see some examples. Most instructions end in <code>.f32</code> if they operate on single-precision and in <code>.f64</code> if they operate in double-precision. We can add two single-precision numbers using <code>vadd.f32 Rdest, Rsource1, Rsource2</code> and double-precision using <code>vadd.f64 Rdest, Rsource1, Rsource2</code>. Note also that we can use predication in these instructions (but be aware that, as usual, predication uses the flags in <code>cpsr</code> not in <code>fpscr</code>). Predication would be specified before the suffix like in <code>vaddne.f32</code>.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">//</span> For this example assume that len = <span style="color: #ff0000;">4</span><span style="color: #339933;">,</span> stride = <span style="color: #ff0000;">2</span>
vadd<span style="color: #339933;">.</span>f32 s1<span style="color: #339933;">,</span> s2<span style="color: #339933;">,</span> s3  <span style="color: #339933;">/*</span> s1 ← s2 <span style="color: #339933;">+</span> s3<span style="color: #339933;">.</span> Scalar operation because Rdest = s1 <span style="color: #00007f; font-weight: bold;">in</span> the bank <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
vadd<span style="color: #339933;">.</span>f32 s1<span style="color: #339933;">,</span> s8<span style="color: #339933;">,</span> s15 <span style="color: #339933;">/*</span> s1 ← s8 <span style="color: #339933;">+</span> s15<span style="color: #339933;">.</span> ditto <span style="color: #339933;">*/</span>
vadd<span style="color: #339933;">.</span>f32 s8<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s24 <span style="color: #339933;">/*</span> s8  ← s16 <span style="color: #339933;">+</span> s24
                      s10 ← s18 <span style="color: #339933;">+</span> s26
                      s12 ← s20 <span style="color: #339933;">+</span> s28
                      s14 ← s22 <span style="color: #339933;">+</span> s30
                      <span style="color: #00007f; font-weight: bold;">or</span> more compactly <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s12<span style="color: #339933;">,</span>s14<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s20<span style="color: #339933;">,</span>s22<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s28<span style="color: #339933;">,</span>s30<span style="color: #009900; font-weight: bold;">&#125;</span>
                      Vectorial<span style="color: #339933;">,</span> since Rdest <span style="color: #00007f; font-weight: bold;">and</span> Rsource2 are <span style="color: #00007f; font-weight: bold;">not</span> <span style="color: #00007f; font-weight: bold;">in</span> bank <span style="color: #ff0000;">0</span>
                   <span style="color: #339933;">*/</span>
vadd<span style="color: #339933;">.</span>f32 s10<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s24 <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s10<span style="color: #339933;">,</span>s12<span style="color: #339933;">,</span>s14<span style="color: #339933;">,</span>s8<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s20<span style="color: #339933;">,</span>s22<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s24<span style="color: #339933;">,</span>s26<span style="color: #339933;">,</span>s28<span style="color: #339933;">,</span>s30<span style="color: #009900; font-weight: bold;">&#125;</span><span style="color: #339933;">.</span>
                       Vectorial<span style="color: #339933;">,</span> but note the wraparound inside the bank after s14<span style="color: #339933;">.</span>
                     <span style="color: #339933;">*/</span>
vadd<span style="color: #339933;">.</span>f32 s8<span style="color: #339933;">,</span> s16<span style="color: #339933;">,</span> s3 <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s8<span style="color: #339933;">,</span>s10<span style="color: #339933;">,</span>s12<span style="color: #339933;">,</span>s14<span style="color: #009900; font-weight: bold;">&#125;</span> ← <span style="color: #009900; font-weight: bold;">&#123;</span>s16<span style="color: #339933;">,</span>s18<span style="color: #339933;">,</span>s20<span style="color: #339933;">,</span>s22<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s3<span style="color: #339933;">,</span>s3<span style="color: #339933;">,</span>s3<span style="color: #339933;">,</span>s3<span style="color: #009900; font-weight: bold;">&#125;</span>
                     Scalar expanded since Rsource2 is <span style="color: #00007f; font-weight: bold;">in</span> the bank <span style="color: #ff0000;">0</span>
                   <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h2>Load and store</h2>
<p>
Once we have a rough idea of how we can operate floating points in VFPv2, a question remains: how do we load/store floating point values from/to memory? VFPv2 provides several specific load/store instructions.
</p>
<p>
We load/store one single-precision floating point using <code>vldr</code>/<code>vstr</code>. The address of the load/store must be already in a general purpose register, although we can apply an offset in bytes which must be a multiple of 4 (this applies to double-precision as well).
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vldr s1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r3<span style="color: #009900; font-weight: bold;">&#93;</span>         <span style="color: #339933;">/*</span> s1 ← <span style="color: #339933;">*</span>r3 <span style="color: #339933;">*/</span>
vldr s2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>     <span style="color: #339933;">/*</span> s2 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r3 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
vldr s3<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>     <span style="color: #339933;">/*</span> s3 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r3 <span style="color: #339933;">+</span> <span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
vldr s4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#93;</span>     <span style="color: #339933;">/*</span> s3 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r3 <span style="color: #339933;">+</span> <span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
&nbsp;
vstr s10<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r4 ← s10 <span style="color: #339933;">*/</span>
vstr s11<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>     <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> ← s11 <span style="color: #339933;">*/</span>
vstr s12<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>     <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#41;</span> ← s12 <span style="color: #339933;">*/</span>
vstr s13<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #339933;">,</span> #<span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#93;</span>      <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#41;</span> ← s13 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
We can load/store several registers with a single instruction. In contrast to general load/store, we cannot load an arbitrary set of registers but instead they must be a sequential set of registers.
</p>
<pre>
// Here precision can be s or d for single-precision and double-precision
// floating-point-register-set is {sFirst-sLast} for single-precision 
// and {dFirst-dLast} for double-precision
vldm indexing-mode precision Rbase{!}, floating-point-register-set
vstm indexing-mode precision Rbase{!}, floating-point-register-set
</pre>
<p>
The behaviour is similar to the indexing modes we saw in chapter 10. There is a Rbase register used as the base address of several load/store to/from floating point registers. There are only two indexing modes: increment after and decrement before. When using increment after, the address used to load/store the floating point value register is increased by 4 after the load/store has happened. When using decrement before, the base address is first substracted as many bytes as foating point values are going to be loaded/stored. Rbase is always updated in decrement before but it is optional to update it in increment after.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vldmias r4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s3<span style="color: #339933;">-</span>s8<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> s3 ← <span style="color: #339933;">*</span>r4
                       s4 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                       s5 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                       s6 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                       s7 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                       s8 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">20</span><span style="color: #009900; font-weight: bold;">&#41;</span>
                     <span style="color: #339933;">*/</span>
vldmias r4!<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s3<span style="color: #339933;">-</span>s8<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> Like the previous instruction
                        but <span style="color: #0000ff; font-weight: bold;">at</span> the end r4 ← r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">24</span> 
                      <span style="color: #339933;">*/</span>
vstmdbs r5!<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s12<span style="color: #339933;">-</span>s13<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span>  <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r5 <span style="color: #339933;">-</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> <span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span> ← s12
                           <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r5 <span style="color: #339933;">-</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*</span> <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> ← s13
                           r5 ← r5 <span style="color: #339933;">-</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span><span style="color: #ff0000;">2</span>
                       <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
For the usual stack operations when we push onto the stack several floating point registers we will use <code>vstmdb</code> with <code>sp!</code> as the base register. To pop from the stack we will use <code>vldmia</code> again with <code>sp!</code> as the base register. Given that these instructions names are very hard to remember we can use the mnemonics <code>vpush</code> and <code>vpop</code>, respectively.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vpush <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">-</span>s5<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">/*</span> Equivalent to vstmdb <span style="color: #46aa03; font-weight: bold;">sp</span>!<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">-</span>s5<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
vpop <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">-</span>s5<span style="color: #009900; font-weight: bold;">&#125;</span>  <span style="color: #339933;">/*</span> Equivalent to vldmia <span style="color: #46aa03; font-weight: bold;">sp</span>!<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>s0<span style="color: #339933;">-</span>s5<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h2>Movements between registers</h2>
<p>
Another operation that may be required sometimes is moving among registers. Similar to the <code>mov</code> instruction for general purpose registers there is the <code>vmov</code> instruction. Several movements are possible.
</p>
<p>We can move floating point values between two floating point registers of the same precision</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vmov s2<span style="color: #339933;">,</span> s3  <span style="color: #339933;">/*</span> s2 ← s3 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Between one general purpose register and one single-precision register. But note that data is not converted. Only bits are copied around, so be aware of not mixing floating point values with integer instructions or the other way round.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vmov s2<span style="color: #339933;">,</span> r3  <span style="color: #339933;">/*</span> s2 ← r3 <span style="color: #339933;">*/</span>
vmov r4<span style="color: #339933;">,</span> s5  <span style="color: #339933;">/*</span> r4 ← s5 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Like the previous case but between two general purpose registers and two consecutive single-precision registers.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vmov s2<span style="color: #339933;">,</span> s3<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r10</span> <span style="color: #339933;">/*</span> s2 ← r4
                        s3 ← <span style="color: #46aa03; font-weight: bold;">r10</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Between two general purpose registers and one double-precision register. Again, note that data is not converted.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vmov d3<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r6  <span style="color: #339933;">/*</span> Lower32BitsOf<span style="color: #009900; font-weight: bold;">&#40;</span>d3<span style="color: #009900; font-weight: bold;">&#41;</span> ← r4
                    Higher32BitsOf<span style="color: #009900; font-weight: bold;">&#40;</span>d3<span style="color: #009900; font-weight: bold;">&#41;</span> ← r6
                 <span style="color: #339933;">*/</span>
vmov r5<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> d4 <span style="color: #339933;">/*</span> r5 ← Lower32BitsOf<span style="color: #009900; font-weight: bold;">&#40;</span>d4<span style="color: #009900; font-weight: bold;">&#41;</span>
                   r7 ← Higher32BitsOf<span style="color: #009900; font-weight: bold;">&#40;</span>d4<span style="color: #009900; font-weight: bold;">&#41;</span>
                 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h2>Conversions</h2>
<p>
Sometimes we need to convert from an integer to a floating-point and the opposite. Note that some conversions may potentially lose precision, in particular when a floating point is converted to an integer. There is a single instruction <code>vcvt</code> with a suffix <code>.T.S</code> where <code>T</code> (target) and <code>S</code> (source) can be <code>u32</code>, <code>s32</code>, <code>f32</code> and <code>f64</code> (<code>S</code> must be different to <code>T</code>).
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">vcvt<span style="color: #339933;">.</span>f64<span style="color: #339933;">.</span>f32 d0<span style="color: #339933;">,</span> s4  <span style="color: #339933;">/*</span> Converts s4 single<span style="color: #339933;">-</span>precision value 
                        to a double<span style="color: #339933;">-</span>precision value <span style="color: #00007f; font-weight: bold;">and</span> stores it <span style="color: #00007f; font-weight: bold;">in</span> d0 <span style="color: #339933;">*/</span>
vcvt<span style="color: #339933;">.</span>f32<span style="color: #339933;">.</span>f64 s4<span style="color: #339933;">,</span> d0  <span style="color: #339933;">/*</span> Converts d0 double<span style="color: #339933;">-</span>precision value 
                        to a single<span style="color: #339933;">-</span>precision value  <span style="color: #00007f; font-weight: bold;">and</span> stores it <span style="color: #00007f; font-weight: bold;">in</span> s4 <span style="color: #339933;">*/</span>
vcvt<span style="color: #339933;">.</span>f32<span style="color: #339933;">.</span>s32 s4<span style="color: #339933;">,</span> r3  <span style="color: #339933;">/*</span> Converts r3 signed integer value 
                        to a single<span style="color: #339933;">-</span>precision value <span style="color: #00007f; font-weight: bold;">and</span> stores <span style="color: #00007f; font-weight: bold;">in</span> s4 <span style="color: #339933;">*/</span>
vcvt<span style="color: #339933;">.</span>f32<span style="color: #339933;">.</span>u32 s4<span style="color: #339933;">,</span> r3  <span style="color: #339933;">/*</span> Converts r3 unsigned integer value 
                        to a single<span style="color: #339933;">-</span>precision value <span style="color: #00007f; font-weight: bold;">and</span> stores <span style="color: #00007f; font-weight: bold;">in</span> s4 <span style="color: #339933;">*/</span>
vcvt<span style="color: #339933;">.</span>f64<span style="color: #339933;">.</span>s32 d2<span style="color: #339933;">,</span> r3  <span style="color: #339933;">/*</span> Converts r3 signed integer value 
                        to a double<span style="color: #339933;">-</span>precision value <span style="color: #00007f; font-weight: bold;">and</span> stores <span style="color: #00007f; font-weight: bold;">in</span> d2 <span style="color: #339933;">*/</span>
vcvt<span style="color: #339933;">.</span>f64<span style="color: #339933;">.</span>u32 d2<span style="color: #339933;">,</span> r3  <span style="color: #339933;">/*</span> Converts r3 unsigned integer value 
                        to a double<span style="color: #339933;">-</span>precision value <span style="color: #00007f; font-weight: bold;">and</span> stores <span style="color: #00007f; font-weight: bold;">in</span> d2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h2>Modifying fpscr</h2>
<p>
The special register fpscr, where <code>len</code> and <code>stride</code> are set, cannot be done directly. Instead we have to load fpscr into a general purpose register using <code>vmrs</code> instruction. Then we operate on the register and move it back to the <code>fpscr</code>, using the <code>vmsr</code> instruction.
</p>
<p>
The value of <code>len</code> is stored in bits 16 to 18 of <code>fpscr</code>. The value of <code>len</code> is not directly stored directly in these bits. Instead, we have to substract 1 before setting the bits. This is because <code>len</code> cannot be 0 (it does not make sense to operate 0 floating points). This way the value <code>000</code> in these bits means <code>len</code> = 1, <code>001</code> means <code>len</code> = 2, &#8230;, <code>111</code> means <code>len</code> = 8. The following is a code that sets <code>len</code> to 8.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> Set the len field of fpscr to be <span style="color: #ff0000;">8</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff; font-weight: bold;">bits</span><span style="color: #339933;">:</span> <span style="color: #ff0000;">111</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> #<span style="color: #ff0000;">7</span>                            <span style="color: #339933;">/*</span> r5 ← <span style="color: #ff0000;">7</span><span style="color: #339933;">.</span> <span style="color: #ff0000;">7</span> is <span style="color: #ff0000;">111</span> <span style="color: #00007f; font-weight: bold;">in</span> binary <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">16</span>                   <span style="color: #339933;">/*</span> r5 ← r5 &lt;&lt; <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
vmrs r4<span style="color: #339933;">,</span> fpscr                        <span style="color: #339933;">/*</span> r4 ← fpscr <span style="color: #339933;">*/</span>
orr r4<span style="color: #339933;">,</span> r4<span style="color: #339933;">,</span> r5                        <span style="color: #339933;">/*</span> r4 ← r4 | r5<span style="color: #339933;">.</span> Bitwise <span style="color: #00007f; font-weight: bold;">OR</span> <span style="color: #339933;">*/</span>
vmsr fpscr<span style="color: #339933;">,</span> r4                        <span style="color: #339933;">/*</span> fpscr ← r4 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
<code>stride</code> is stored in bits 20 to 21 of <code>fpscr</code>. Similar to <code>len</code>, a value of <code>00</code> in these bits means <code>stride</code> = 1, <code>01</code> means <code>stride</code> = 2, <code>10</code> means <code>stride</code> = 3 and <code>11</code> means <code>stride</code> = 4.
</p>
<h2>Function call convention and floating-point registers</h2>
<p>
Since we have introduced new registers we should state how to use them when calling functions. The following rules apply for VFPv2 registers.
</p>
<ul>
<li>Fields <code>len</code> and <code>stride</code> of <code>fpscr</code> are zero at the entry of a function and must be zero when leaving it.
<li>We can pass floating point parameters using registers <code>s0</code>-<code>s15</code> and <code>d0</code>-<code>d7</code>. Note that passing a double-precision after a single-precision may involve discarding an odd-numbered single-precision register (for instance we can use <code>s0</code>, and <code>d1</code> but note that <code>s1</code> will be unused).
<li>All other floating point registers (<code>s16</code>-<code>s31</code> and <code>d8</code>-<code>d15</code>) must have their values preserved upon leaving the function. Instructions <code>vpush</code> and <code>vpop</code> can be used for that.
<li>If a function returns a floating-point value, the return register will be <code>s0</code> or <code>d0</code>.
</ul>
<p>
Finally a note about variadic functions like printf: you cannot pass a single-precision floating point to one of such functions. Only doubles can be passed. So you will need to convert the single-precision values into double-precision values. Note also that usual integer registers are used (<code>r0</code>-<code>r3</code>), so you will only be able to pass up to 2 double-precision values, the remaining must be passed on the stack. In particular for <code>printf</code>, since <code>r0</code> contains the address of the string format, you will only be able to pass a double-precision in <code>{r2,r3}</code>.
</p>
<h2>Assembler</h2>
<p>
Make sure you pass the flag <code>-mfpu=vfpv2</code> to <code>as</code>, otherwise it will not recognize the VFPv2 instructions.
</p>
<h2>Colophon</h2>
<p>
You may want to check this official <a href="http://infocenter.arm.com/help/topic/com.arm.doc.qrc0007e/QRC0007_VFP.pdf">quick reference card of VFP</a>. Note that it includes also VFPv3 not available in the Raspberry Pi processor. Most of what is there has already been presented here although some minor details may have been omitted.
</p>
<p>
In the next chapter we will use these instructions in a full example.
</p>
<p>
That&#8217;s all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F05%2F12%2Farm-assembler-raspberry-pi-chapter-13%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%2013" id="wpa2a_4"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/05/12/arm-assembler-raspberry-pi-chapter-13/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Capybara, pop up windows and the new PayPal sandbox</title>
		<link>http://thinkingeek.com/2013/04/27/capybara-pop-windows-paypal-sandbox/</link>
		<comments>http://thinkingeek.com/2013/04/27/capybara-pop-windows-paypal-sandbox/#comments</comments>
		<pubDate>Sat, 27 Apr 2013 10:55:46 +0000</pubDate>
		<dc:creator>brafales</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[capybara]]></category>
		<category><![CDATA[paypal]]></category>
		<category><![CDATA[ruby on rails]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=1002</guid>
		<description><![CDATA[This past weeks we have been doing a massive refactoring of our testing suite at work to set up a nice CI server setup, proper factories, etc. Our tool-belt so far is basically a well known list of Rails gems: Factory Girl for factories. RSpec as a testing framework (although we&#8217;ll switch back to Test::Unit [...]]]></description>
				<content:encoded><![CDATA[<p>This past weeks we have been doing a massive refactoring of our testing suite at work to set up a nice CI server setup, proper factories, etc. Our tool-belt so far is basically a well known list of Rails gems:</p>
<ul>
<li><span style="line-height: 16px;"><a href="https://github.com/thoughtbot/factory_girl" target="_blank">Factory Girl</a> for factories.</span></li>
<li>RSpec as a testing framework (although we&#8217;ll switch back to <a href="http://www.ruby-doc.org/stdlib-2.0/libdoc/test/unit/rdoc/Test/Unit.html" target="_blank">Test::Unit</a> soon).</li>
<li><a href="https://github.com/jnicklas/capybara" target="_blank">Capybara</a> for integration testing.</li>
</ul>
<p>For the CI server we decided to use a third party SaaS as our dev team is small and we don&#8217;t have the manpower nor the time to set it up ourselves, and we went for <a href="https://circleci.com/" target="_blank">CircleCI</a>, which has given us good results so far (easy to set up, in fact almost works out of the box without having to do anything, it has a good integration with <a href="https://github.com/" target="_blank">GitHub</a>, it&#8217;s reasonably fast, and the guys are continuously improving it and very receptive to client&#8217;s feedback).</p>
<p>Back to the post topic, when refactoring the integration tests, we discovered that PayPal decided recently to change the way their development sandbox works, and the tests we had in place broke because of it.</p>
<p>The basic workflow when having to test with PayPal involves a series of steps:</p>
<ul>
<li><span style="line-height: 16px;">Visit their sandbox page and log in with your testing credentials. This saves a cookie in the browser.</span></li>
<li>Go back to your test page and do the steps needed to perform a payment using PayPal.</li>
<li>Authenticate again to PayPal with your test buyers account and pay.</li>
<li>Catch the PayPal response and do whatever you need to finish your test.</li>
</ul>
<p>With the old PayPal sandbox, the login was pretty straightforward as you only needed to find the username and password fields in the login form of the sandbox page, fill them in, click the login button, and that was all. But with the new version it&#8217;s not that easy. The new sandbox has no login form at the main page. It has a login button which you have to click, then a popup window is shown with the login form. In there you have to input your credentials and click on the login button. Then this popup window does some server side magic, closes itself and triggers a reload on the main page, which will finally show you as logged in.</p>
<p>There&#8217;s probably a <code>POST</code> request that you can automatically do to simplify all this, but PayPal is not known as <em>developer documentation friendly</em> so I couldn&#8217;t find it. As a result, we had to modify our Capybara tests to handle this new scenario. As we&#8217;ve never worked with pop up windows before I thought it&#8217;d be nice to share how we did it in case you need to do something similar.</p>
<p>The basic workflow is as follows:</p>
<ul>
<li><span style="line-height: 16px;">Open the main PayPal sandbox window.</span></li>
<li>Click on the login button.</li>
<li>Find the new popup window.</li>
<li>Fill in the form in that new window.</li>
<li>Go back to your main window.</li>
<li>Continue with your usual testing.</li>
</ul>
<p>This assumes you are using the <code>Selenium</code> driver for Capybara. Here&#8217;s the code we used to get this done:</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="ruby" style="font-family:monospace;">describe <span style="color:#996600;">&quot;a paypal express transaction&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:js</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#0000FF; font-weight:bold;">true</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  it <span style="color:#996600;">&quot;should just work&quot;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
    <span style="color:#008000; font-style:italic;"># Visit the PayPal sandbox url</span>
    visit <span style="color:#996600;">&quot;https://developer.paypal.com/&quot;</span>
&nbsp;
    <span style="color:#008000; font-style:italic;"># The link for the login button has no id...</span>
    find<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#ff3333; font-weight:bold;">:xpath</span>, <span style="color:#996600;">&quot;//a[contains(@class,'ppLogin_internal cleanslate scTrack:ppAccess-login ppAccessBtn')]&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">click</span>
&nbsp;
    <span style="color:#008000; font-style:italic;"># Here we have to use the driver to find the newly opened window using it's name</span>
    <span style="color:#008000; font-style:italic;"># We also get the reference to the main window as later on we'll have to go back to it</span>
    login_window = page.<span style="color:#9900CC;">driver</span>.<span style="color:#9900CC;">find_window</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'PPA_identity_window'</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    main_window = page.<span style="color:#9900CC;">driver</span>.<span style="color:#9900CC;">find_window</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">''</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
    <span style="color:#008000; font-style:italic;"># We use this to execute the next instructions in the popup window</span>
    page.<span style="color:#9900CC;">within_window</span><span style="color:#006600; font-weight:bold;">&#40;</span>login_window<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
      <span style="color:#008000; font-style:italic;">#Normally fill in the form and log in</span>
      fill_in <span style="color:#996600;">'email'</span>, <span style="color:#ff3333; font-weight:bold;">:with</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;&lt;your paypal sandbox username&gt;&quot;</span>
      fill_in <span style="color:#996600;">'password'</span>, <span style="color:#ff3333; font-weight:bold;">:with</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> <span style="color:#996600;">&quot;&lt;your paypal sandbox password&gt;&quot;</span>
      click_button <span style="color:#996600;">'Log In'</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
    <span style="color:#008000; font-style:italic;">#More on this sleep later</span>
    <span style="color:#CC0066; font-weight:bold;">sleep</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006666;">30</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
    <span style="color:#008000; font-style:italic;">#Switch back to the main window and do the rest of the test in it</span>
    page.<span style="color:#9900CC;">within_window</span><span style="color:#006600; font-weight:bold;">&#40;</span>main_window<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span>
      <span style="color:#008000; font-style:italic;">#Here goes the rest of your test</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></td></tr></table></div>

<p>Now there is an important thing to note on the code above: the <code>sleep(30)</code> call. By now you may have read on hundreds of places that using <code>sleep</code> is not a good practice and that your tests should not rely on that. And that&#8217;s true. However, PayPal does a weird thing and this is the only way I could use to make the tests pass. It turns out that after clicking the <em>Log In</em> button, the system does some behind the curtains magic, and after having done that, the popup window closes itself and then triggers a reload on the main page. This reload triggering makes things difficult. If you instruct Capybara to visit your page right after clicking the <em>Log In</em> button, you risk having that reload trigger fired in between, and then your test will fail because the next selector you use will not be found as the browser will be in the PayPal sandbox page.</p>
<p>There are probably better and more elegant ways to get around this. Maybe place a code to re-trigger your original visit if it detects you are still on the PayPal page, etc. Feel free to use the comments to suggest possible solutions to that particular problem.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F04%2F27%2Fcapybara-pop-windows-paypal-sandbox%2F&amp;title=Capybara%2C%20pop%20up%20windows%20and%20the%20new%20PayPal%20sandbox" id="wpa2a_6"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/04/27/capybara-pop-windows-paypal-sandbox/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 12</title>
		<link>http://thinkingeek.com/2013/03/28/arm-assembler-raspberry-pi-chapter-12/</link>
		<comments>http://thinkingeek.com/2013/03/28/arm-assembler-raspberry-pi-chapter-12/#comments</comments>
		<pubDate>Thu, 28 Mar 2013 15:29:53 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>
		<category><![CDATA[arm]]></category>
		<category><![CDATA[assembler]]></category>
		<category><![CDATA[pi]]></category>
		<category><![CDATA[raspberry]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=823</guid>
		<description><![CDATA[We saw in chapter 6 some simple schemes to implement usual structured programming constructs like if-then-else and loops. In this chapter we will revisit these constructs and exploit a feature of the ARM instruction set that we have not learnt yet. Playing with loops The most generic form of loop is this one. while &#40;E&#41; [...]]]></description>
				<content:encoded><![CDATA[<p>
We saw in chapter 6 some simple schemes to implement usual structured programming constructs like if-then-else and loops. In this chapter we will revisit these constructs and exploit a feature of the ARM instruction set that we have not learnt yet.
</p>
<p><span id="more-823"></span></p>
<h2>Playing with loops</h2>
<p>
The most generic form of loop is this one.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span>
  S<span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
There are also two special forms, which are actually particular incarnations of the one shown above but are interesting as well.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">=</span> lower<span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;=</span> upper<span style="color: #339933;">;</span> i <span style="color: #339933;">+=</span> step<span style="color: #009900;">&#41;</span>
  S<span style="color: #339933;">;</span></pre></td></tr></table></div>


<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">do</span> 
  S
<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
Some languages, like Pascal, have constructs like this one.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="pascal" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">repeat</span>
  S
<span style="color: #000000; font-weight: bold;">until</span> E<span style="color: #000066;">;</span></pre></td></tr></table></div>

<p>
but this is like a <code>do S while (!E)</code>.
</p>
<p>
We can manipulate loops to get a form that may be more convenient. For instance.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;">   <span style="color: #b1b100;">do</span> 
     S
   <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* Can be rewritten as */</span>
&nbsp;
   S<span style="color: #339933;">;</span>
   <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span>
      S<span style="color: #339933;">;</span></pre></td></tr></table></div>


<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;">   <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span>
     S<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* Can be rewritten as */</span>
&nbsp;
   <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span>
   <span style="color: #009900;">&#123;</span>
      S<span style="color: #339933;">;</span>
      <span style="color: #b1b100;">do</span>
        S
      <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
   <span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>
The last manipulation is interesting, because we can avoid the <code>if-then</code> if we directly go to the <code>while</code> part.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* This is not valid C */</span>
<span style="color: #b1b100;">goto</span> check<span style="color: #339933;">;</span>
<span style="color: #b1b100;">do</span>
  S
check<span style="color: #339933;">:</span> <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
In valid C, the above transformation would be written as follows.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">goto</span> check<span style="color: #339933;">;</span>
loop<span style="color: #339933;">:</span>
  S<span style="color: #339933;">;</span>
check<span style="color: #339933;">:</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span> <span style="color: #b1b100;">goto</span> loop<span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
Which looks much uglier than abusing a bit C syntax.
</p>
<h2>The -s suffix</h2>
<p>
So far, when checking the condition of an <code>if</code> or <code>while</code>, we have evaluated the condition and then used the <code>cmp</code> intruction to update <code>cpsr</code>. The update of the <code>cpsr</code> is mandatory for our conditional codes, no matter if we use branching or predication. But <code>cmp</code> is not the only way to update <code>cpsr</code>. In fact many instructions can update it.
</p>
<p>
By default an instruction does not update <code>cpsr</code> unless we append the suffix <code>-s</code>. So instead of the instruction <code>add</code> or <code>sub</code> we write <code>adds</code> or <code>subs</code>. The result of the instruction (what would be stored in the destination register) is used to update <code>cpsr</code>.
</p>
<p>
How can we use this? Well, consider this simple loop counting backwards.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> for <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #00007f; font-weight: bold;">int</span> i = <span style="color: #ff0000;">100</span> <span style="color: #666666; font-style: italic;">; i &gt;= 0; i--) */</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">100</span>
<span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span>
  <span style="color: #339933;">/*</span> <span style="color: #0000ff; font-weight: bold;">do</span> something <span style="color: #339933;">*/</span>
  <span style="color: #00007f; font-weight: bold;">sub</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>      <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">-</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
  <span style="color: #00007f; font-weight: bold;">cmp</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>          <span style="color: #339933;">/*</span> update cpsr with r1 <span style="color: #339933;">-</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
  bge <span style="color: #00007f; font-weight: bold;">loop</span>            <span style="color: #339933;">/*</span> branch if r1 &gt;= <span style="color: #ff0000;">100</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
If we replace <code>sub</code> by <code>subs</code> then <code>cpsr</code> will be updated with the result of the substration. This means that the flags N, Z, C and V will be updated, so we can use a branch right after <code>subs</code>. In our case we want to jump back to loop only if <code>i >= 0</code>, this is when the result is non-negative. We can use <code>bpl</code> to achieve this.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> for <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #00007f; font-weight: bold;">int</span> i = <span style="color: #ff0000;">100</span> <span style="color: #666666; font-style: italic;">; i &gt;= 0; i--) */</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">100</span>
<span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span>
  <span style="color: #339933;">/*</span> <span style="color: #0000ff; font-weight: bold;">do</span> something <span style="color: #339933;">*/</span>
  subs r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>      <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">-</span> <span style="color: #ff0000;">1</span>  <span style="color: #00007f; font-weight: bold;">and</span> update cpsr with the final r1 <span style="color: #339933;">*/</span>
  <span style="color: #46aa03; font-weight: bold;">bpl</span> <span style="color: #00007f; font-weight: bold;">loop</span>             <span style="color: #339933;">/*</span> branch if the previous <span style="color: #00007f; font-weight: bold;">sub</span> computed a positive number <span style="color: #009900; font-weight: bold;">&#40;</span>N flag <span style="color: #00007f; font-weight: bold;">in</span> cpsr is <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
It is a bit tricky to get these things right (this is why we use compilers). For instance this similar, but not identical, loop would use <code>bne</code> instead of <code>bpl</code>. Here the condition is <code>ne</code> (not equal). It would be nice to have an alias like <code>nz</code> (not zero) but, unfortunately, this does not exist in ARM.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> for <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #00007f; font-weight: bold;">int</span> i = <span style="color: #ff0000;">100</span> <span style="color: #666666; font-style: italic;">; i &gt; 0; i--). Note here i &gt; 0, not i &gt;= 0 as in the example above */</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">100</span>
<span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span>
  <span style="color: #339933;">/*</span> <span style="color: #0000ff; font-weight: bold;">do</span> something <span style="color: #339933;">*/</span>
  subs r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>      <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">-</span> <span style="color: #ff0000;">1</span>  <span style="color: #00007f; font-weight: bold;">and</span> update cpsr with the final r1 <span style="color: #339933;">*/</span>
  bne <span style="color: #00007f; font-weight: bold;">loop</span>             <span style="color: #339933;">/*</span> branch if the previous <span style="color: #00007f; font-weight: bold;">sub</span> computed a number that is <span style="color: #00007f; font-weight: bold;">not</span> zero <span style="color: #009900; font-weight: bold;">&#40;</span>Z flag <span style="color: #00007f; font-weight: bold;">in</span> cpsr is <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
A rule of thumb where we may want to apply the use of the -s suffix is in codes in the following form.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;">s <span style="color: #339933;">=</span> ...
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>s @ <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>
where <code>@</code> means any comparison respect 0 (equals, different, lower, etc.).
</p>
<h2>Operating 64-bit numbers</h2>
<p>
As an example of using the suffix -s we will implement three 64-bit integer operations in ARM: addition, substraction and multiplication. Remember that ARM is a 32-bit architecture, so everything is 32-bit minded. If we only use 32-bit numbers, this is not a problem, but if for some reason we need 64-bit numbers things get a bit more complicated. We will represent a 64-bit number as two 32-bit numbers, the lower and higher part. This way a 64-bit number n represented using two 32-bit parts, n<sub>lower</sub> and n<sub>higher</sub> will have the value n = 2<sup>32</sup> × n<sub>higher</sub> + n<sub>lower</sub>
</p>
<p>
We will, obviously, need to kep the 32-bit somewhere. When keeping them in registers, we will use two consecutive registers (e.g. r1 and r2, that we will write it as <code>{r1,r2}</code>) and we will keep the higher part in the higher numbered register. When keeping a 64-bit number in memory, we will store in two consecutive addresses the two parts, being the lower one in the lower address. The address will be 8-byte aligned.
</p>
<h3>Addition</h3>
<p>
Adding two 64-bit numbers using 32-bit operands means adding first the lower part and then adding the higher parts but taking into account a possible carry from the lower part. With our current knowledge we could write something like this (assume the first number is in <code>{r2,r3}</code>, the second in <code>{r4,r5}</code> and the result will be in <code>{r0,r1}</code>).
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> r5      <span style="color: #339933;">/*</span> First we <span style="color: #00007f; font-weight: bold;">add</span> the higher part <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> r1 ← r3 <span style="color: #339933;">+</span> r5 <span style="color: #339933;">*/</span>
adds r0<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r4     <span style="color: #339933;">/*</span> Now we <span style="color: #00007f; font-weight: bold;">add</span> the lower part <span style="color: #00007f; font-weight: bold;">and</span> we update cpsr <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> r0 ← r2 <span style="color: #339933;">+</span> r4 <span style="color: #339933;">*/</span>
<span style="color: #adadad; font-style: italic;">addc</span>s r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>    <span style="color: #339933;">/*</span> If adding the lower part caused carry<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #ff0000;">1</span> to the higher part <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> if C = <span style="color: #ff0000;">1</span> then r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> Note that here the suffix <span style="color: #339933;">-</span>s is <span style="color: #00007f; font-weight: bold;">not</span> applied<span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #46aa03; font-weight: bold;">cs</span> means carry set <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
This would work. Fortunately ARM provides an instructions <code>adc</code> which adds two numbers and the carry flag. So we could rewrite the above code with just two instructions.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">adds r0<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r4     <span style="color: #339933;">/*</span> First <span style="color: #00007f; font-weight: bold;">add</span> the lower part <span style="color: #00007f; font-weight: bold;">and</span> update cpsr <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> r0 ← r2 <span style="color: #339933;">+</span> r4 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">adc</span> r1<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> r5      <span style="color: #339933;">/*</span> Now <span style="color: #00007f; font-weight: bold;">add</span> the higher part plus the carry from the lower one <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> r1 ← r3 <span style="color: #339933;">+</span> r5 <span style="color: #339933;">+</span> C <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h3>Substraction</h3>
<p>
Substracting two numbers is similar to adding them. In ARM when substracting two numbers using <code>subs</code>, if we need to borrow (because the second operand is larger than the first) then C will be disabled (C will be 0). If we do not need to borrow, C will be enabled (C will be 1). This is a bit surprising but consistent with the remainder of the architecture (check in chapter 5 conditions CS/HS and CC/LO). Similar to <code>adc</code> there is a <code>sbc</code> which performs a normal substraction if C is 1. Otherwise it substracts one more element. Again, this is consistent on how C works in the <code>subs</code> instruction.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">subs r0<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r4     <span style="color: #339933;">/*</span> First <span style="color: #00007f; font-weight: bold;">add</span> the lower part <span style="color: #00007f; font-weight: bold;">and</span> update cpsr <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> r0 ← r2 <span style="color: #339933;">-</span> r4 <span style="color: #339933;">*/</span>
sbc r1<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> r5      <span style="color: #339933;">/*</span> Now <span style="color: #00007f; font-weight: bold;">add</span> the higher part plus the <span style="color: #00007f; font-weight: bold;">NOT</span> of the carry from the lower one <span style="color: #339933;">*/</span>
                    <span style="color: #339933;">/*</span> r1 ← r3 <span style="color: #339933;">-</span> r5 <span style="color: #339933;">-</span> ~C <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<h3>Multiplication</h3>
<p>
Multiplying two 64-bit numbers is a tricky thing. When we multiply two N-bit numbers the result may need up to 2*N-bits. So when multiplying two 64-bit numbers we may need a 128-bit number. For the sake of simplicity we will assume that this does not happen and 64-bit will be enough. Our 64-bit numbers are two 32-bit integers, so a 64-bit x is actually x = 2<sup>32</sup> × x<sub>1</sub> + x<sub>0</sub>, where x<sub>1</sub> and x<sub>0</sub> are two 32-bit numbers. Similarly another 64-bit number y would be y = 2<sup>32</sup> × y<sub>1</sub> + y<sub>0</sub>. Multiplying x and y yields z where z = 2<sup>64</sup> × x<sub>1</sub> × y<sub>1</sub> + 2<sup>32</sup> × (x<sub>0</sub> × y<sub>1</sub> + x<sub>1</sub> × y<sub>0</sub>) + x<sub>0</sub> × y<sub>0</sub>. Well, now our problem is multiplying each x<sub>i</sub> by y<sub>i</sub>, but again we may need 64-bit to represent the value.
</p>
<p>
ARM provides a bunch of different instructions for multiplication. Today we will see just three of them. If we are multiplying 32-bits and we do not care about the result not fitting in a 32-bit number we can use <code>mul Rd, Rsource1, Rsource2</code>. Unfortunately it does not set any flag in the <code>cpsr</code> useful for detecting an overflow of the multiplication (i.e. when the result does not fit in the 32-bit range). This instruction is the fastest one of the three. If we do want the 64-bit resulting from the multiplication, we have two other instructions <code>smull</code> and <code>umull</code>. The former is used when we multiply to numbers in two&#8217;s complement, the latter when we represent unsigned values. Their syntax is <code>{s,u}mull RdestLower, RdestHigher, Rsource1, Rsource2</code>. The lower part of the 64-bit result is kept in the register <code>RdestLower</code> and the higher part in he register <code>RdestHigher</code>.
</p>
<p>
In this example we have to use <code>umull</code> otherwise the 32-bit lower parts might end being interpreted as negative numbers, giving negative intermediate values. That said, we can now multiply x<sub>0</sub> and y<sub>0</sub>. Recall that we have the two 64-bit numbers in <code>r2,r3</code> and <code>r4,r5</code> pairs of registers. So first multiply <code>r2</code> and <code>r4</code>. Note the usage of <code>r0</code> since this will be its final value. In contrast, register <code>r6</code> will be used later.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">umull r0<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r4</pre></td></tr></table></div>

<p>
Now let&#8217;s multiply x<sub>0</sub> by y<sub>1</sub> and x<sub>1</sub> by y<sub>0</sub>. This is <code>r3</code> by <code>r4</code> and <code>r2</code> by <code>r5</code>. Note how we overwrite <code>r4</code> and <code>r5</code> in the second multiplication. This is fine since we will not need them anymore.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">umull r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> r4
umull r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r5</pre></td></tr></table></div>

<p>
There is no need to make the multiplication of x<sub>1</sub> by y<sub>1</sub> because if it gives a nonzero value, it will always overflow a 64-bit number. This means that if both <code>r3</code> and <code>r5</code> were nonzero, the multiplication will never fit a 64-bit. This is a suficient condition, but not a necessary one. The number might overflow when adding the intermediate values that will result in <code>r1</code>.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">adds r2<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> r4
<span style="color: #00007f; font-weight: bold;">adc</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r6</pre></td></tr></table></div>

<p>
Let&#8217;s package this code in a nice function in a program to see if it works. We will multiply numbers 12345678901 (this is 2×2<sup>32</sup> + 3755744309) and 12345678 and print the result.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> mult64<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">align</span> <span style="color: #ff0000;">4</span>
message <span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Multiplication of %lld by %lld is %lld\n&quot;</span>
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">align</span> <span style="color: #ff0000;">8</span>
number_a_low<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">3755744309</span>
number_a_high<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">2</span>
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">align</span> <span style="color: #ff0000;">8</span>
number_b_low<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">12345678</span>
number_b_high<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
<span style="color: #339933;">/*</span> Note<span style="color: #339933;">:</span> This is <span style="color: #00007f; font-weight: bold;">not</span> the most efficient way to doa <span style="color: #ff0000;">64</span><span style="color: #339933;">-</span>bit multiplication<span style="color: #339933;">.</span>
   This is for illustration purposes <span style="color: #339933;">*/</span>
mult64<span style="color: #339933;">:</span>
   <span style="color: #339933;">/*</span> The argument will be passed <span style="color: #00007f; font-weight: bold;">in</span> r0<span style="color: #339933;">,</span> r1 <span style="color: #00007f; font-weight: bold;">and</span> r2<span style="color: #339933;">,</span> r3 <span style="color: #00007f; font-weight: bold;">and</span> returned <span style="color: #00007f; font-weight: bold;">in</span> r0<span style="color: #339933;">,</span> r1 <span style="color: #339933;">*/</span>
   <span style="color: #339933;">/*</span> Keep the registers that we are going to <span style="color: #0000ff; font-weight: bold;">write</span> <span style="color: #339933;">*/</span>
   <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>
   <span style="color: #339933;">/*</span> For covenience<span style="color: #339933;">,</span> move <span style="color: #009900; font-weight: bold;">&#123;</span>r0<span style="color: #339933;">,</span>r1<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #00007f; font-weight: bold;">into</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span>r5<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
   <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r0   <span style="color: #339933;">/*</span> r0 ← r4 <span style="color: #339933;">*/</span>
   <span style="color: #00007f; font-weight: bold;">mov</span> r5<span style="color: #339933;">,</span> r1   <span style="color: #339933;">/*</span> r5 ← r1 <span style="color: #339933;">*/</span>
&nbsp;
   umull r0<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r4    <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r0<span style="color: #339933;">,</span>r6<span style="color: #009900; font-weight: bold;">&#125;</span> ← r2 <span style="color: #339933;">*</span> r4 <span style="color: #339933;">*/</span>
   umull r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> r4    <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r7<span style="color: #339933;">,</span><span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #009900; font-weight: bold;">&#125;</span> ← r3 <span style="color: #339933;">*</span> r4 <span style="color: #339933;">*/</span>
   umull r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r5    <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span>r5<span style="color: #009900; font-weight: bold;">&#125;</span> ← r2 <span style="color: #339933;">*</span> r5 <span style="color: #339933;">*/</span>
   adds r2<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> r4         <span style="color: #339933;">/*</span> r2 ← r7 <span style="color: #339933;">+</span> r4 <span style="color: #00007f; font-weight: bold;">and</span> update cpsr <span style="color: #339933;">*/</span>
   <span style="color: #00007f; font-weight: bold;">adc</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r6          <span style="color: #339933;">/*</span> r1 ← r2 <span style="color: #339933;">+</span> r6 <span style="color: #339933;">+</span> C <span style="color: #339933;">*/</span>
&nbsp;
   <span style="color: #339933;">/*</span> Restore registers <span style="color: #339933;">*/</span>
   <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>
   <span style="color: #46aa03; font-weight: bold;">bx</span> lr                   <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> mult64 <span style="color: #339933;">*/</span>
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>       <span style="color: #339933;">/*</span> Keep the registers we are going to modify <span style="color: #339933;">*/</span>
                                        <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">r8</span> is <span style="color: #00007f; font-weight: bold;">not</span> actually used here<span style="color: #339933;">,</span> but this way 
                                           the <span style="color: #0000ff; font-weight: bold;">stack</span> is already <span style="color: #ff0000;">8</span><span style="color: #339933;">-</span><span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> Load the numbers from memory <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span>r5<span style="color: #009900; font-weight: bold;">&#125;</span> ← a <span style="color: #339933;">*/</span>
    ldr r4<span style="color: #339933;">,</span> addr_number_a_low       <span style="color: #339933;">/*</span> r4 ← &amp;a_low <span style="color: #339933;">*/</span>
    ldr r4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span>                    <span style="color: #339933;">/*</span> r4 ← <span style="color: #339933;">*</span>r4 <span style="color: #339933;">*/</span>
    ldr r5<span style="color: #339933;">,</span> addr_number_a_high      <span style="color: #339933;">/*</span> r5 ← &amp;a_high  <span style="color: #339933;">*/</span>
    ldr r5<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r5<span style="color: #009900; font-weight: bold;">&#93;</span>                    <span style="color: #339933;">/*</span> r5 ← <span style="color: #339933;">*</span>r5 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r6<span style="color: #339933;">,</span>r7<span style="color: #009900; font-weight: bold;">&#125;</span> ← b <span style="color: #339933;">*/</span>
    ldr r6<span style="color: #339933;">,</span> addr_number_b_low       <span style="color: #339933;">/*</span> r6 ← &amp;b_low  <span style="color: #339933;">*/</span>
    ldr r6<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r6<span style="color: #009900; font-weight: bold;">&#93;</span>                    <span style="color: #339933;">/*</span> r6 ← <span style="color: #339933;">*</span>r6 <span style="color: #339933;">*/</span>
    ldr r7<span style="color: #339933;">,</span> addr_number_b_high      <span style="color: #339933;">/*</span> r7 ← &amp;b_high  <span style="color: #339933;">*/</span>
    ldr r7<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r7<span style="color: #009900; font-weight: bold;">&#93;</span>                    <span style="color: #339933;">/*</span> r7 ← <span style="color: #339933;">*</span>r7 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> Now prepare the <span style="color: #00007f; font-weight: bold;">call</span> to mult64
    <span style="color: #339933;">/*</span> 
       The first number is passed <span style="color: #00007f; font-weight: bold;">in</span> 
       registers <span style="color: #009900; font-weight: bold;">&#123;</span>r0<span style="color: #339933;">,</span>r1<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #00007f; font-weight: bold;">and</span> the second one <span style="color: #00007f; font-weight: bold;">in</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r2<span style="color: #339933;">,</span>r3<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> r4                  <span style="color: #339933;">/*</span> r0 ← r4 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r5                  <span style="color: #339933;">/*</span> r1 ← r5 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> r6                  <span style="color: #339933;">/*</span> r2 ← r6 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r3<span style="color: #339933;">,</span> r7                  <span style="color: #339933;">/*</span> r3 ← r7 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #46aa03; font-weight: bold;">bl</span> mult64                  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> mult64 function <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> The result of the multiplication is <span style="color: #00007f; font-weight: bold;">in</span> r0<span style="color: #339933;">,</span>r1 <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #339933;">/*</span> Now prepare the <span style="color: #00007f; font-weight: bold;">call</span> to printf <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> We have to pass &amp;message<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span>r5<span style="color: #009900; font-weight: bold;">&#125;</span><span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r6<span style="color: #339933;">,</span>r7<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r0<span style="color: #339933;">,</span>r1<span style="color: #009900; font-weight: bold;">&#125;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r1<span style="color: #009900; font-weight: bold;">&#125;</span>                   <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r1 onto the <span style="color: #0000ff; font-weight: bold;">stack</span><span style="color: #339933;">.</span> 4th <span style="color: #009900; font-weight: bold;">&#40;</span>higher<span style="color: #009900; font-weight: bold;">&#41;</span> parameter <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r0<span style="color: #009900; font-weight: bold;">&#125;</span>                   <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r0 onto the <span style="color: #0000ff; font-weight: bold;">stack</span><span style="color: #339933;">.</span> 4th <span style="color: #009900; font-weight: bold;">&#40;</span>lower<span style="color: #009900; font-weight: bold;">&#41;</span> parameter <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r7<span style="color: #009900; font-weight: bold;">&#125;</span>                   <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r7 onto the <span style="color: #0000ff; font-weight: bold;">stack</span><span style="color: #339933;">.</span> 3rd <span style="color: #009900; font-weight: bold;">&#40;</span>higher<span style="color: #009900; font-weight: bold;">&#41;</span> parameter <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r6<span style="color: #009900; font-weight: bold;">&#125;</span>                   <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r6 onto the <span style="color: #0000ff; font-weight: bold;">stack</span><span style="color: #339933;">.</span> 3rd <span style="color: #009900; font-weight: bold;">&#40;</span>lower<span style="color: #009900; font-weight: bold;">&#41;</span> parameter <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r3<span style="color: #339933;">,</span> r5                  <span style="color: #339933;">/*</span> r3 ← r5<span style="color: #339933;">.</span>                2rd <span style="color: #009900; font-weight: bold;">&#40;</span>higher<span style="color: #009900; font-weight: bold;">&#41;</span> parameter <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> r4                  <span style="color: #339933;">/*</span> r2 ← r4<span style="color: #339933;">.</span>                2nd <span style="color: #009900; font-weight: bold;">&#40;</span>lower<span style="color: #009900; font-weight: bold;">&#41;</span> parameter <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> addr_of_message     <span style="color: #339933;">/*</span> r0 ← &amp;message           1st parameter <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                   <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Call</span> printf <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">16</span>             <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">sp</span> ← <span style="color: #46aa03; font-weight: bold;">sp</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">16</span> <span style="color: #339933;">*/</span>
                                <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> the two registers we pushed above <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                  <span style="color: #339933;">/*</span> r0 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> r5<span style="color: #339933;">,</span> r6<span style="color: #339933;">,</span> r7<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">r8</span><span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>        <span style="color: #339933;">/*</span> Restore the registers we kept <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr                       <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> main <span style="color: #339933;">*/</span>
&nbsp;
addr_of_message <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message
addr_number_a_low<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> number_a_low
addr_number_a_high<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> number_a_high
addr_number_b_low<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> number_b_low
addr_number_b_high<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> number_b_high</pre></td></tr></table></div>

<p>
Observe first that we have the addresses of the lower and upper part of each number. Instead of this we could load them by just using an offset, as we saw in chapter 8. So, in lines 41 to 44 we could have done the following.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>40
41
42
43
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    <span style="color: #339933;">/*</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span>r5<span style="color: #009900; font-weight: bold;">&#125;</span> ← a <span style="color: #339933;">*/</span>
    ldr r4<span style="color: #339933;">,</span> addr_number_a_low       <span style="color: #339933;">/*</span> r4 ← &amp;a_low <span style="color: #339933;">*/</span>
    ldr r5<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #339933;">,</span> <span style="color: #339933;">+</span>#<span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>               <span style="color: #339933;">/*</span> r5 ← <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    ldr r4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r4<span style="color: #009900; font-weight: bold;">&#93;</span>                    <span style="color: #339933;">/*</span> r4 ← <span style="color: #339933;">*</span>r4  <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
In the function <code>mult64</code> we pass the first value (x) as <code>r0,r1</code> and the second value (y) as <code>r2,r3</code>. The result is stored in <code>r0,r1</code>. We move the values to the appropiate registers for parameter passing in lines 57 to 61.
</p>
<p>
Printing the result is a bit complicated. 64-bits must be passed as pairs of consecutive registers where the lower part is in an even numbered register. Since we pass the address of the message<br />
in <code>r0</code> we cannot pass the first 64-bit integer in <code>r1</code>. So we skip <code>r1</code> and we use <code>r2</code> and <code>r3</code> for the first argument. But now we have run out of registers for parameter passing. When this happens, we have to use the stack for parameter passing.
</p>
<p>
Two rules have to be taken into account when passing data in the stack.
</p>
<ol>
<li>You must ensure that the stack is aligned for the data you are going to pass (by adjusting the stack first). So, for 64-bit numbers, the stack must be 8-byte aligned. If you pass an 32-bit number and then a 64-bit number, you will have to skip 4 bytes before passing the 64-bit number. Do not forget to keep the stack always 8-byte aligned per the Procedure Call Standard for ARM Architecture (AAPCS) requirement.
<li>An argument with a lower position number in the call must have a lower address in the stack. So we have to pass the arguments in opposite order.
</ol>
<p>
The second rule is what explains why we push first <code>r1</code> and then <code>r0</code>, when they are the registers containing the last 64-bit number (the result of the multiplication) we want to pass to <code>printf</code>.
</p>
<p>
Note that in the example above, we cannot pass the parameters in the stack using <code>push {r0,r1,r6,r7}</code>, which is equivalent to <code>push {r0}</code>, <code>push {r1}</code>, <code>push {r6}</code> and <code>push {r7}</code>, but not equivalent to the required order when passing the arguments on the stack.
</p>
<p>
If we run the program we should see something like.
</p</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="bash" style="font-family:monospace;">$ .<span style="color: #000000; font-weight: bold;">/</span>mult64_2
Multiplication of <span style="color: #000000;">12345678901</span> by <span style="color: #000000;">12345678</span> is <span style="color: #000000;">152415776403139878</span></pre></td></tr></table></div>

<p>
That&#8217;s all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F03%2F28%2Farm-assembler-raspberry-pi-chapter-12%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%2012" id="wpa2a_8"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/03/28/arm-assembler-raspberry-pi-chapter-12/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi &#8211; Chapter 11</title>
		<link>http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/</link>
		<comments>http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/#comments</comments>
		<pubDate>Sat, 16 Mar 2013 15:38:18 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>
		<category><![CDATA[arm]]></category>
		<category><![CDATA[assembler]]></category>
		<category><![CDATA[branches]]></category>
		<category><![CDATA[function]]></category>
		<category><![CDATA[function call]]></category>
		<category><![CDATA[functions]]></category>
		<category><![CDATA[pi]]></category>
		<category><![CDATA[predication]]></category>
		<category><![CDATA[raspberry]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=772</guid>
		<description><![CDATA[Several times, in earlier chapters, I stated that the ARM architecture was designed with the embedded world in mind. Although the cost of the memory is everyday lower, it still may account as an important part of the budget of an embedded system. The ARM instruction set has several features meant to reduce the impact [...]]]></description>
				<content:encoded><![CDATA[<p>
Several times, in earlier chapters, I stated that the ARM architecture was designed with the embedded world in mind. Although the cost of the memory is everyday lower, it still may account as an important part of the budget of an embedded system. The ARM instruction set has several features meant to reduce the impact of code size. One of the features which helps in such approach is <strong>predication</strong>.
</p>
<p><span id="more-772"></span></p>
<h2>Predication</h2>
<p>
We saw in chapters 6 and 7 how to use branches in our program in order to modify the execution flow of instructions and implement useful control structures. Branches can be unconditional, for instance when calling a function as we did in chapters 9 and 10, or conditional when we want to jump to some part of the code only when a previously tested condition is met.
</p>
<p>
Predication is related to conditional branches. What if, instead of branching to some part of code meant to be executed only when a condition <code>C</code> holds, we were able to <em>turn</em> some instructions <em>off</em> when that <code>C</code> condition does not hold?. Consider some case like this.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>C<span style="color: #009900;">&#41;</span>
  T<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #b1b100;">else</span>
  E<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
Using predication (and with some invented syntax to express it) we could write the above if as follows.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;">P <span style="color: #339933;">=</span> C<span style="color: #339933;">;</span>
<span style="color: #009900;">&#91;</span>P<span style="color: #009900;">&#93;</span>  T<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#91;</span><span style="color: #339933;">!</span>P<span style="color: #009900;">&#93;</span> E<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
This way we avoid branches. But, why would be want to avoid branches? Well, executing a conditional branch involves a bit of uncertainty. But this deserves a bit of explanation.
</p>
<h3>The assembly line of instructions</h3>
<p>
Imagine an assembly line. In that assembly line there are 5 workers, each one fully specialized in a single task. That assembly line <em>executes</em> instructions. Every instruction enters the assembly line from the left and leaves it at the right. Each worker does some task on the instruction and passes to the next worker to the right. Also, imagine all workers are more or less synchronized, each one ends the task in as much <code>6</code> seconds. This means that at every 6 seconds there is an instruction leaving the assembly line, an instruction fully executed. It also means that at any given time there may be up to 5 instructions being processed (although not fully executed, we only have one fully executed instruction at every 6 seconds).
</p>
<p><img src="http://thinkingeek.com/wp-content/uploads/2013/03/pipeline.png" alt="The assembly line of instructions" width="537" height="124" class="aligncenter size-full wp-image-803" /></p>
<p>
The first worker <em>fetches</em> instructions and puts them in the assembly line. It fetches the instruction at the address specified by the register <code>pc</code>. By default, unless told, this worker <em>fetches</em> the instruction physically following the one he previously fetched (this is <em>implicit sequencing</em>).
</p>
<p>
In this assembly line, the worker that checks the condition of a conditional branch is not the first one but the third one. Now consider what happens when the first worker fetches a conditional branch and puts it in the assembly line. The second worker will process it and pass it to the third one. The third one will process it by checking the condition of the conditional branch. If it does not hold, nothing happens, the branch has no effect. But if the condition holds, the third worker must notify the first one that the next instruction fetched should be the instruction at the address of the branch.
</p>
<p>
But now there are two instructions in the assembly line that should not be fully executed (the ones that were physically after the conditional branch). There are several options here. The third worker may pick two stickers labeled as <span style="font-variant: small-caps;">do nothing</span>, and stick them to the two next instructions. Another approach would be the third worker to tell the first and second workers «hey guys, stick a <span style="font-variant: small-caps;">do nothing</span> to your current instruction». Later workers, when they see these <span style="font-variant: small-caps;">do nothing</span> stickers will do, huh, nothing. This way each <span style="font-variant: small-caps;">do nothing</span> instruction will never be fully executed.
</p>
<p><img src="http://thinkingeek.com/wp-content/uploads/2013/03/bombolla.png" alt="The third worker realizes that a branch is taken. Next two instructions will get a DO NOTHING sticker" width="535" height="556" class="aligncenter size-full wp-image-821" /></p>
<p>
But by doing this, that nice property of our assembly line is gone: now we do not have a fully executed instruction every 6 seconds. In fact, after the conditional branch there are two <span style="font-variant: small-caps;">do nothing</span> instructions. A program that is constantly doing branches may well reduce the performance of our assembly line from one (useful) instruction each 6 seconds to one instruction each 18 seconds. This is three times slower!
</p>
<p>
Truth is that modern processors, including the one in the Raspberry Pi, have <em>branch predictors</em> which are able to mitigate these problems: they try to predict whether the condition will hold, so the branch is taken or not. Branch predictors, though, predict the future like stock brokers, using the past and, when there is no past information, using some sensible assumptions. So branch predictors may work very well with relatively predictable codes but may work not so well if the code has unpredictable behaviour. Such behaviour, for instance, is observed when running decompressors. A compressor reduces the size of your files removing the redundancy. Redundant stuff is predictable and can be omitted (for instance in &#8220;he is wearing his coat&#8221; you could ommit &#8220;he&#8221; or replace &#8220;his&#8221; by &#8220;its&#8221;, regardless of whether doing this is rude, because you know you are talking about a male). So a decompressor will have to decompress a file which has very little redundancy, driving nuts the predictor.
</p>
<p>
Back to the assembly line example, it would be the first worker who attempts to predict where the branch will be taken or not. It is the third worker who verifies if the first worker did the right prediction. If the first worker mispredicted the branch, then we have to apply two stickers again and notify the first worker which is the right address of the next instruction. If the first worker predicted the branch right, nothing special has to be done, which is great.
</p>
<p>
If we avoid branches, we avoid the uncertainty of whether the branch is taken or not. So it looks like that predication is the way to go. Not so fast. Processing a bunch of instructions that are actually turned off is not an efficient usage of a processor.
</p>
<p>
Back to our assembly line, the third worker will check the predicate. If it does not hold, the current instruction will get a <span style="font-variant: small-caps;">do nothing</span> sticker but in contrast to a branch, it does not notify the first worker.
</p>
<p>
So it ends, as usually, that no approach is perfect on its own.
</p>
<h2>Predication in ARM</h2>
<p>
In ARM, predication is very simple to use: almost all instructions can be predicated. The predicate is specified as a suffix to the instruction name. The suffix is exactly the same as those used in branches in the chapter 5: <code>eq</code>, <code>neq</code>, <code>le</code>, <code>lt</code>, <code>ge</code> and <code>gt</code>. Instructions that are not predicated are assumed to have a suffix <code>al</code> standing for <em><strong>al</strong>ways</em>. That predicate always holds and we do not write it for economy (it is valid though). You can understand conditional branches as predicated branches if you feel like.
</p>
<h2>Collatz conjecture revisited</h2>
<p>
In chapter 6 we implementd an algorithm that computed the length of the sequence of Hailstone of a given number. Though not proved yet, no number has been found that has an infinite Hailstone sequence. Given our knowledge of functions we learnt in chapters 9 and 10, I encapsulated the code that computes the length of the sequence of Hailstone in a function.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> collatz02<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
&nbsp;
message<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Type a number: &quot;</span>
scan_format <span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;%d&quot;</span>
message2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Length of the Hailstone sequence for %d is %d\n&quot;</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
collatz<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 contains the first argument <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> Only r0<span style="color: #339933;">,</span> r1 <span style="color: #00007f; font-weight: bold;">and</span> r2 are modified<span style="color: #339933;">,</span> 
       so we <span style="color: #0000ff; font-weight: bold;">do</span> <span style="color: #00007f; font-weight: bold;">not</span> need to keep anything
       <span style="color: #00007f; font-weight: bold;">in</span> the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #339933;">/*</span> Since we <span style="color: #0000ff; font-weight: bold;">do</span> <span style="color: #00007f; font-weight: bold;">not</span> <span style="color: #0000ff; font-weight: bold;">do</span> any <span style="color: #00007f; font-weight: bold;">call</span><span style="color: #339933;">,</span> we <span style="color: #0000ff; font-weight: bold;">do</span>
       <span style="color: #00007f; font-weight: bold;">not</span> have to keep lr either <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r0                 <span style="color: #339933;">/*</span> r1 ← r0 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> r0 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
  collatz_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>                 <span style="color: #339933;">/*</span> compare r1 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    beq collatz_end            <span style="color: #339933;">/*</span> if r1 == <span style="color: #ff0000;">1</span> branch to collatz_end <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">and</span> r2<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r2 ← r1 &amp; <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> compare r2 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    bne collatz_odd            <span style="color: #339933;">/*</span> if r2 != <span style="color: #ff0000;">0</span> <span style="color: #009900; font-weight: bold;">&#40;</span>this is r1 <span style="color: #339933;">%</span> <span style="color: #ff0000;">2</span> != <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> branch to collatz_odd <span style="color: #339933;">*/</span>
  collatz_even<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> ASR #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r1 ← r1 &gt;&gt; <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is r1 ← r1<span style="color: #339933;">/</span><span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
    b collatz_end_loop         <span style="color: #339933;">/*</span> branch to collatz_end_loop <span style="color: #339933;">*/</span>
  collatz_odd<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>     <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r1 &lt;&lt; <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is r1 ← <span style="color: #ff0000;">3</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> <span style="color: #339933;">*/</span>
  collatz_end_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b collatz_loop             <span style="color: #339933;">/*</span> branch back to collatz_loop <span style="color: #339933;">*/</span>
  collatz_end<span style="color: #339933;">:</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>lr<span style="color: #009900; font-weight: bold;">&#125;</span>                       <span style="color: #339933;">/*</span> keep lr <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>                  <span style="color: #339933;">/*</span> make room for <span style="color: #ff0000;">4</span> bytes <span style="color: #00007f; font-weight: bold;">in</span> the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
                                    <span style="color: #339933;">/*</span> The <span style="color: #0000ff; font-weight: bold;">stack</span> is already <span style="color: #ff0000;">8</span> <span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_message      <span style="color: #339933;">/*</span> first parameter of printf<span style="color: #339933;">:</span> &amp;message <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                       <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> printf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_scan_format  <span style="color: #339933;">/*</span> first parameter of scanf<span style="color: #339933;">:</span> &amp;scan_format <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span>                      <span style="color: #339933;">/*</span> second parameter of scanf<span style="color: #339933;">:</span> 
                                       address of the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> scanf                        <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> scanf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>                    <span style="color: #339933;">/*</span> first parameter of collatz<span style="color: #339933;">:</span>
                                       the value stored <span style="color: #009900; font-weight: bold;">&#40;</span>by scanf<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #00007f; font-weight: bold;">in</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> collatz                      <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> collatz <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> r0                      <span style="color: #339933;">/*</span> third parameter of printf<span style="color: #339933;">:</span> 
                                       the result of collatz <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>                    <span style="color: #339933;">/*</span> second parameter of printf<span style="color: #339933;">:</span>
                                       the value stored <span style="color: #009900; font-weight: bold;">&#40;</span>by scanf<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #00007f; font-weight: bold;">in</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> address_of_message2     <span style="color: #339933;">/*</span> first parameter of printf<span style="color: #339933;">:</span> &amp;address_of_message2 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf
&nbsp;
    <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>lr<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr
&nbsp;
&nbsp;
address_of_message<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message
address_of_scan_format<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> scan_format
address_of_message2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message2</pre></td></tr></table></div>

<h2>Adding predication</h2>
<p>
Ok, let&#8217;s add some predication. There is an <em>if-then-else</em> construct in lines 22 to 31. There we check if the number is even or odd. If even we divide it by 2, if even we multiply it by 3 and add 1.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>22
23
24
25
26
27
28
29
30
31
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    <span style="color: #00007f; font-weight: bold;">and</span> r2<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r2 ← r1 &amp; <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> compare r2 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    bne collatz_odd            <span style="color: #339933;">/*</span> if r2 != <span style="color: #ff0000;">0</span> <span style="color: #009900; font-weight: bold;">&#40;</span>this is r1 <span style="color: #339933;">%</span> <span style="color: #ff0000;">2</span> != <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> branch to collatz_odd <span style="color: #339933;">*/</span>
  collatz_even<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> ASR #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r1 ← r1 &gt;&gt; <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is r1 ← r1<span style="color: #339933;">/</span><span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
    b collatz_end_loop         <span style="color: #339933;">/*</span> branch to collatz_end_loop <span style="color: #339933;">*/</span>
  collatz_odd<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>     <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r1 &lt;&lt; <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is r1 ← <span style="color: #ff0000;">3</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> <span style="color: #339933;">*/</span>
  collatz_end_loop<span style="color: #339933;">:</span></pre></td></tr></table></div>

<p>
Note in line 24 that there is a <code>bne</code> (<strong>b</strong>ranch if <strong>n</strong>ot <strong>e</strong>qual). We can use this condition (and its opposite <code>eq</code>) to predicate this <em>if-then-else</em> construct. Instructions in the <em>then</em> part will be predicated using <code>eq</code>, instructions in the <em>else</em> part will be predicated using <code>ne</code>. The resulting code is shown below.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> compare r2 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    moveq r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> ASR #<span style="color: #ff0000;">1</span>       <span style="color: #339933;">/*</span> if r2 == <span style="color: #ff0000;">0</span><span style="color: #339933;">,</span> r1 ← r1 &gt;&gt; <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is r1 ← r1<span style="color: #339933;">/</span><span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
    addne r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>   <span style="color: #339933;">/*</span> if r2 != <span style="color: #ff0000;">0</span><span style="color: #339933;">,</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r1 &lt;&lt; <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is r1 ← <span style="color: #ff0000;">3</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    addne r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>           <span style="color: #339933;">/*</span> if r2 != <span style="color: #ff0000;">0</span><span style="color: #339933;">,</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
As you can se there are no labels in the predicated version. We do not branch now so they are not needed anymore. Note also that we actually removed two branches: the one that branches from the condition test code to the <em>else</em> part and the one that branches from the end of the <em>then</em> part to the instruction after the whole <em>if-then-else</em>. This leads to a more compact code.
</p>
<h2>Does it make any difference in performance?</h2>
<p>
Taken as is, this program is very small to be accountable for time, so I modified it to run the same calculation inside the collatz function 4194304 (this is 2<sup>22</sup>) times. I chose the number after some tests, so the execution did not take too much time to be a tedium.
</p>
<p>
Sadly, while the Raspberry Pi processor provides some hardware performance counters I have not been able to use any of them. <code>perf</code> tool (from the package <code>linux-tools-3.2</code>) complains that the counter cannot be opened.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="shell" style="font-family:monospace;">$ perf_3.2 stat -e cpu-cycles ./collatz02
  Error: open_counter returned with 19 (No such device). /bin/dmesg may provide additional information.
&nbsp;
  Fatal: Not all events could be opened</pre></td></tr></table></div>

<p>
<code>dmesg</code> does not provide any additional information. We can see, though, that the performance counters was loaded by the kernel.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="shell" style="font-family:monospace;">$ dmesg | grep perf
[    0.061722] hw perfevents: enabled with v6 PMU driver, 3 counters available</pre></td></tr></table></div>

<p>
Supposedly I should be able to measure up to 3 hardware events at the same time. I think the Raspberry Pi processor, packaged in the BCM2835 SoC does not provide a PMU (Performance Monitoring Unit) which is required for performance counters. Nevertheless we can use <code>cpu-clock</code> to measure the time.
</p>
<p>
Below are the versions I used for this comparison. First is the branches version, second the predication version.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">collatz<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 contains the first argument <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> Make sure the <span style="color: #0000ff; font-weight: bold;">stack</span> is <span style="color: #ff0000;">8</span> <span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r0
    <span style="color: #00007f; font-weight: bold;">mov</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4194304</span>
  collatz_repeat<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r4                 <span style="color: #339933;">/*</span> r1 ← r0 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> r0 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
  collatz_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>                 <span style="color: #339933;">/*</span> compare r1 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    beq collatz_end            <span style="color: #339933;">/*</span> if r1 == <span style="color: #ff0000;">1</span> branch to collatz_end <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">and</span> r2<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r2 ← r1 &amp; <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> compare r2 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    bne collatz_odd            <span style="color: #339933;">/*</span> if r2 != <span style="color: #ff0000;">0</span> <span style="color: #009900; font-weight: bold;">&#40;</span>this is r1 <span style="color: #339933;">%</span> <span style="color: #ff0000;">2</span> != <span style="color: #ff0000;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span> branch to collatz_odd <span style="color: #339933;">*/</span>
  collatz_even<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> ASR #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r1 ← r1 &gt;&gt; <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is r1 ← r1<span style="color: #339933;">/</span><span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
    b collatz_end_loop         <span style="color: #339933;">/*</span> branch to collatz_end_loop <span style="color: #339933;">*/</span>
  collatz_odd<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>     <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r1 &lt;&lt; <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is r1 ← <span style="color: #ff0000;">3</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> <span style="color: #339933;">*/</span>
  collatz_end_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b collatz_loop             <span style="color: #339933;">/*</span> branch back to collatz_loop <span style="color: #339933;">*/</span>
  collatz_end<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> r3<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    bne collatz_repeat
    <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> Make sure the <span style="color: #0000ff; font-weight: bold;">stack</span> is <span style="color: #ff0000;">8</span> <span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">collatz2<span style="color: #339933;">:</span>
    <span style="color: #339933;">/*</span> r0 contains the first argument <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> Make sure the <span style="color: #0000ff; font-weight: bold;">stack</span> is <span style="color: #ff0000;">8</span> <span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r0
    <span style="color: #00007f; font-weight: bold;">mov</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4194304</span>
  collatz_repeat<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r4                 <span style="color: #339933;">/*</span> r1 ← r0 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> r0 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
  collatz2_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>                 <span style="color: #339933;">/*</span> compare r1 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    beq collatz2_end           <span style="color: #339933;">/*</span> if r1 == <span style="color: #ff0000;">1</span> branch to collatz2_end <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">and</span> r2<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r2 ← r1 &amp; <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>                 <span style="color: #339933;">/*</span> compare r2 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    moveq r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> ASR #<span style="color: #ff0000;">1</span>       <span style="color: #339933;">/*</span> if r2 == <span style="color: #ff0000;">0</span><span style="color: #339933;">,</span> r1 ← r1 &gt;&gt; <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is r1 ← r1<span style="color: #339933;">/</span><span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
    addne r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>   <span style="color: #339933;">/*</span> if r2 != <span style="color: #ff0000;">0</span><span style="color: #339933;">,</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r1 &lt;&lt; <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">.</span> This is r1 ← <span style="color: #ff0000;">3</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    addne r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>           <span style="color: #339933;">/*</span> if r2 != <span style="color: #ff0000;">0</span><span style="color: #339933;">,</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> <span style="color: #339933;">*/</span>
  collatz2_end_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b collatz2_loop            <span style="color: #339933;">/*</span> branch back to collatz2_loop <span style="color: #339933;">*/</span>
  collatz2_end<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> r3<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>
    bne collatz_repeat
    <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>             <span style="color: #339933;">/*</span> Restore the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #009900; font-weight: bold;">&#125;</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>

<p>
The tool <code>perf</code> can be used to gather performance counters. We will run 5 times each version. We will use number 123. We redirect the output of <code>yes 123</code> to the standard input of our tested program. This way we do not have to type it (which may affect the timing of the comparison).
</p>
<p>
The version with branches gives the following results:
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="shell" style="font-family:monospace;">$ yes 123 | perf_3.2 stat --log-fd=3 --repeat=5 -e cpu-clock ./collatz_branches 3&gt;&amp;1
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
&nbsp;
 Performance counter stats for './collatz_branches' (5 runs):
&nbsp;
       3359,953200 cpu-clock                  ( +-  0,01% )
&nbsp;
       3,365263737 seconds time elapsed                                          ( +-  0,01% )</pre></td></tr></table></div>

<p>
(When redirecting the input of <code>perf</code> one must specify the file descriptor for the output of <code>perf stat</code> itself. In this case we have used the file descriptor number 3 and then told the shell to redirect the file descriptor number 3 to the standard output, which is the file descriptor number 1).
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="shell" style="font-family:monospace;">$ yes 123 | perf_3.2 stat --log-fd=3 --repeat=5 -e cpu-clock ./collatz_predication 3&gt;&amp;1
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
Type a number: Length of the Hailstone sequence for 123 is 46
&nbsp;
 Performance counter stats for './collatz_predication' (5 runs):
&nbsp;
       2318,217200 cpu-clock                  ( +-  0,01% )
&nbsp;
       2,322732232 seconds time elapsed                                          ( +-  0,01% )</pre></td></tr></table></div>

<p>
So the answer is, yes. In <strong>this case</strong> it does make a difference. The predicated version runs 1,44 times faster than the version using branches. It would be bold, though, to assume that in general predication outperforms branches. Always measure your time.
</p>
<p>
That&#8217;s all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F03%2F16%2Farm-assembler-raspberry-pi-chapter-11%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%2011" id="wpa2a_10"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/03/16/arm-assembler-raspberry-pi-chapter-11/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 10</title>
		<link>http://thinkingeek.com/2013/02/07/arm-assembler-raspberry-pi-chapter-10/</link>
		<comments>http://thinkingeek.com/2013/02/07/arm-assembler-raspberry-pi-chapter-10/#comments</comments>
		<pubDate>Thu, 07 Feb 2013 21:20:41 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>
		<category><![CDATA[arm]]></category>
		<category><![CDATA[assembler]]></category>
		<category><![CDATA[function]]></category>
		<category><![CDATA[function call]]></category>
		<category><![CDATA[functions]]></category>
		<category><![CDATA[pi]]></category>
		<category><![CDATA[raspberry]]></category>
		<category><![CDATA[stack]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=669</guid>
		<description><![CDATA[In chapter 9 we were introduced to functions and we saw that they have to follow a number of conventions in order to play nice with other functions. We also briefly mentioned the stack, as an area of memory owned solely by the function. In this chapter we will go in depth with the stack [...]]]></description>
				<content:encoded><![CDATA[<p>
In chapter 9 we were introduced to functions and we saw that they have to follow a number of conventions in order to play nice with other functions. We also briefly mentioned the stack, as an area of memory owned solely by the function. In this chapter we will go in depth with the stack and why it is important for functions.
</p>
<p><span id="more-669"></span></p>
<h2>Dynamic activation</h2>
<p>
One of the benefits of functions is being able to call them more than once. But that <em>more than once</em> hides a small trap. We are not restricting who will be able to call the function, so it might happen that it is the same function who calls itself. This happens when we use recursion.
</p>
<p>
A typical example of recursion is the factorial of a number <em>n</em>, usually written as <em>n!</em>. A factorial in C can be written as follows.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">int</span> factorial<span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> n<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
   <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>n <span style="color: #339933;">==</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
      <span style="color: #b1b100;">return</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
   <span style="color: #b1b100;">else</span>
      <span style="color: #b1b100;">return</span> n <span style="color: #339933;">*</span> factorial<span style="color: #009900;">&#40;</span>n<span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>
Note that there is only one function <code>factorial</code>, but it may be called several times. For instance: <em>factorial(3) → factorial(2) → factorial(1) → factorial(0)</em>, where → means a «it calls». A function, thus, is <em>dynamically activated</em> each time is called. The span of a dynamic activation goes from the point where the function is called until it returns. At a given time, more than one function is dynamically activated. The whole dynamic activation set of functions includes the current function and the dynamic activation set of the function that called it (the current function).
</p>
<p>
Ok. We have a function that calls itself. No big deal, right? Well, this would not be a problem if it weren&#8217;t for the rules that a function must observe. Let&#8217;s quickly recall them.
</p>
<ul>
<li>Only <code>r0</code>, <code>r1</code>, <code>r2</code> and <code>r3</code> can be freely modified.
<li><code>lr</code> value at the entry of the function must be kept somewhere because we will need it to leave the function (to return to the caller).
<li>All other registers <code>r4</code> to <code>r11</code> and <code>sp</code> can be modified but they must be restored to their original values upon leaving the function.
</ul>
<p>
In chapter 9 we used a global variable to keep <code>lr</code>. But if we attempted to use a global variable in our <em>factorial(3)</em> example, it would be overwritten at the next dynamic activation of factorial. We would only be able to return from <em>factorial(0)</em> to <em>factorial(1)</em>. After that we would be stuck in <em>factorial(1)</em>, as <code>lr</code> would always have the same value.
</p>
<p>
So it looks like we need some way to keep at least the value of <code>lr</code> <strong>per each dynamic activation</strong>. And not only <code>lr</code>, if we wanted to use registers from <code>r4</code> to <code>r11</code> we also need to keep somehow per each dynamic activation, a global variable would not be enough either. This is where the stack comes into play.
</p>
<h2>The stack</h2>
<p>
In computing, a stack is a data structure (a way to organize data that provides some interesting properties). A stack typically has three operations: access the top of the stack, push onto the top, pop from the top. Dependening on the context you can only access the top of the stack, in our case we will be able to access more elements than just the top.
</p>
<p>
But, what is the stack? I already said in chaper 9 that the stack is a region of memory owned solely by the function. We can now reword this a bit better: the stack is a region of memory owned solely by the current dynamic activation. And how we control the stack? Well, in chapter 9 we said that the register <code>sp</code> stands for <em><strong>s</strong>tack <strong>p</strong>ointer</em>. This register will contain the top of the stack. The region of memory owned by the dynamic activation is the extent of bytes contained between the current value of <code>sp</code> and the initial value that <code>sp</code> had at the beginning of the function. We will call that region the <strong>local memory</strong> of a function (more precisely, of a dynamic activation of it). We will put there whatever has to be saved at the beginning of a function and restored before leaving. We will also keep there the <strong>local variables</strong> of a function (dynamic activation).
</p>
<p>
Our function also has to adhere to some rules when handling the stack.
</p>
<ul>
<li>The stack pointer (<code>sp</code>) is always 4 byte aligned. This is absolutely mandatory. However, due to the Procedure Call Standard for the ARM architecture (AAPCS), the stack pointer will have to be 8 byte aligned, otherwise funny things may happen when we call what the AAPCS calls as <em>public interfaces</em> (this is, code written by other people).
<li>The value of <code>sp</code> when leaving the function should be the same value it had upon entering the function.
</ul>
<p>
The first rule is consistent with the alignment constraints of ARM, where most of times addresses must be 4 byte aligned. Due to AAPCS we will stick to the extra 8 byte alignment constraint. The second rule states that, no matter how large is our local memory, it will always disappear at the end of the function. This is important, because local variables of a dynamic activation need not have any storage after that dynamic activation ends.
</p>
<p>
It is a convention how the stack, and thus the local memory, has its size defined. The stack can grow upwards or downwards. If it grows upwards it means that we have to increase the value of the <code>sp</code> register in order to enlarge the local memory. If it grows downwards we have to do the opposite, the value of the <code>sp</code> register must be substracted as many bytes as the size of the local storage. In Linux ARM, the stack grows downwards, towards zero (although it never should reach zero). Addresses of local variables have very large values in the 32 bit range. They are usually close to 2<sup>32</sup>.
</p>
<p>
Another convention when using the stack concerns whether the <code>sp</code> register contains the address of the top of the stack or some bytes above. In Linux ARM the <code>sp</code> register directly points to the top of the stack: in the memory addressed by <code>sp</code> there is useful information.
</p>
<p>
Ok, we know the stack grows downwards and the top of the stack must always be in <code>sp</code>. So to enlarge the local memory it should be enough by decreasing <code>sp</code>. The local memory is then defined by the range of memory from the current <code>sp</code> value to the original value that <code>sp</code> had at the beginning of the function. One register we almost always have to keep is <code>lr</code>. Let&#8217;s see how can we keep in the stack.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">sub</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">8</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">sp</span> ← <span style="color: #46aa03; font-weight: bold;">sp</span> <span style="color: #339933;">-</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">.</span> This enlarges the <span style="color: #0000ff; font-weight: bold;">stack</span> by <span style="color: #ff0000;">8</span> bytes <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>    <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">sp</span> ← lr <span style="color: #339933;">*/</span>
<span style="color: #339933;">...</span> <span style="color: #339933;">//</span> <span style="color: #0000ff; font-weight: bold;">Code</span> of the function
ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>    <span style="color: #339933;">/*</span> lr ← <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">sp</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">8</span>  <span style="color: #339933;">/*</span> <span style="color: #46aa03; font-weight: bold;">sp</span> ← <span style="color: #46aa03; font-weight: bold;">sp</span> <span style="color: #339933;">+</span> <span style="color: #ff0000;">8</span><span style="color: #339933;">.</span> <span style="color: #339933;">/*</span> This reduces the <span style="color: #0000ff; font-weight: bold;">stack</span> by <span style="color: #ff0000;">8</span> bytes
                                effectively restoring the <span style="color: #0000ff; font-weight: bold;">stack</span> 
                                pointer to its original value <span style="color: #339933;">*/</span>
<span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>

<p>
A well behaved function may modify sp but must ensure that at the end it has the same value it had when we entered the function. This is what we do here. We first substract 4 bytes to sp and at the end we add back 4 bytes.
</p>
<p>
This sequence of instructions would do indeed. But maybe you remember chapter 8 and the indexing modes that you could use in load and store. Note that the first two instructions behave exactly like a preindexing. We first update <code>sp</code> and then we use <code>sp</code> as the address where we store <code>lr</code>. This is exactly a preindex! Likewise for the last two instructions. We first load <code>lr</code> using the current address of <code>sp</code> and then we decrease <code>sp</code>. This is exactly a postindex!
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #339933;">-</span><span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> preindex<span style="color: #339933;">:</span> <span style="color: #46aa03; font-weight: bold;">sp</span> ← <span style="color: #46aa03; font-weight: bold;">sp</span> <span style="color: #339933;">-</span> <span style="color: #ff0000;">8</span><span style="color: #666666; font-style: italic;">; *sp ← lr */</span>
<span style="color: #339933;">...</span> <span style="color: #339933;">//</span> <span style="color: #0000ff; font-weight: bold;">Code</span> of the function
ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">8</span>   <span style="color: #339933;">/*</span> postindex<span style="color: #666666; font-style: italic;">; lr ← *sp; sp ← sp + 8 */</span>
<span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>

<p>
Yes, these addressing modes were invented to support this sort of things. Using a single instruction is better in terms of code size. This may not seem relevant, but it is when we realize that the stack bookkeeping is required in almost every function we write!
</p>
<h2>First approach</h2>
<p>
Let&#8217;s implement the factorial function above.
</p>
<p>
First we have to learn a new instruction to multiply two numbers: <code>mul Rdest, Rsource1, Rsource2</code>. Note that multiplying two 32 bit values may require up to 64 bits for the result. This instruction only computes the lower 32 bits. Because we are not going to use 64 bit values in this example, the maximum factorial we will be able to compute is 12! (13! is bigger than 2<sup>32</sup>). We will not check that the entered number is lower than 13 to keep the example simple (I encourage you to add this check to the example, though). In versions of the ARM architecture prior to ARMv6 this instruction could not have <code>Rdest</code> the same as <code>Rsource1</code>. GNU assembler may print a warning if you don&#8217;t pass <code>-march=armv6</code>.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> factorial01<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
&nbsp;
message1<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Type a number: &quot;</span>
format<span style="color: #339933;">:</span>   <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;%d&quot;</span>
message2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;The factorial of %d is %d\n&quot;</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
factorial<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span>#<span style="color: #339933;">-</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> lr onto the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> r0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span>#<span style="color: #339933;">-</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r0 onto the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
                       <span style="color: #339933;">/*</span> Note that after that<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span> is <span style="color: #ff0000;">8</span> <span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>         <span style="color: #339933;">/*</span> compare r0 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    bne is_nonzero     <span style="color: #339933;">/*</span> if r0 != <span style="color: #ff0000;">0</span> then branch <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r0 ← <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is the return <span style="color: #339933;">*/</span>
    b end
is_nonzero<span style="color: #339933;">:</span>
                       <span style="color: #339933;">/*</span> Prepare the <span style="color: #00007f; font-weight: bold;">call</span> to factorial<span style="color: #009900; font-weight: bold;">&#40;</span>n<span style="color: #339933;">-</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>     <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">-</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> factorial
                       <span style="color: #339933;">/*</span> After the <span style="color: #00007f; font-weight: bold;">call</span> r0 contains factorial<span style="color: #009900; font-weight: bold;">&#40;</span>n<span style="color: #339933;">-</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
                       <span style="color: #339933;">/*</span> Load r0 <span style="color: #009900; font-weight: bold;">&#40;</span>that we kept <span style="color: #00007f; font-weight: bold;">in</span> th <span style="color: #0000ff; font-weight: bold;">stack</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #00007f; font-weight: bold;">into</span> r1 <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>       <span style="color: #339933;">/*</span> r1 ← <span style="color: #339933;">*</span><span style="color: #46aa03; font-weight: bold;">sp</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mul</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> r1     <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">*</span> r1 <span style="color: #339933;">*/</span>
&nbsp;
end<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>    <span style="color: #339933;">/*</span> Discard the r0 we kept <span style="color: #00007f; font-weight: bold;">in</span> the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #00007f; font-weight: bold;">and</span> put it <span style="color: #00007f; font-weight: bold;">in</span> lr <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr              <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> factorial <span style="color: #339933;">*/</span>
&nbsp;
<span style="color: #339933;">.</span>globl main
main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span>#<span style="color: #339933;">-</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!            <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> lr onto the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>               <span style="color: #339933;">/*</span> Make room for one <span style="color: #ff0000;">4</span> <span style="color: #0000ff; font-weight: bold;">byte</span> integer <span style="color: #00007f; font-weight: bold;">in</span> the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
                                 <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">In</span> these <span style="color: #ff0000;">4</span> bytes we will keep the number <span style="color: #339933;">*/</span>
                                 <span style="color: #339933;">/*</span> entered by the user <span style="color: #339933;">*/</span>
                                 <span style="color: #339933;">/*</span> Note that after that the <span style="color: #0000ff; font-weight: bold;">stack</span> is <span style="color: #ff0000;">8</span><span style="color: #339933;">-</span><span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> address_of_message1  <span style="color: #339933;">/*</span> Set &amp;message1 as the first parameter of printf <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Call</span> printf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_format    <span style="color: #339933;">/*</span> Set &amp;format as the first parameter of scanf <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span>                   <span style="color: #339933;">/*</span> Set the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> as the second parameter <span style="color: #339933;">*/</span>
                                 <span style="color: #339933;">/*</span> of scanf <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> scanf                     <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Call</span> scanf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>                 <span style="color: #339933;">/*</span> Load the integer read by scanf <span style="color: #00007f; font-weight: bold;">into</span> r0 <span style="color: #339933;">*/</span>
                                 <span style="color: #339933;">/*</span> So we set it as the first parameter of factorial <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> factorial                 <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Call</span> factorial <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> r0                   <span style="color: #339933;">/*</span> Get the result of factorial <span style="color: #00007f; font-weight: bold;">and</span> move it to r2 <span style="color: #339933;">*/</span>
                                 <span style="color: #339933;">/*</span> So we set it as the third parameter of printf <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span>                 <span style="color: #339933;">/*</span> Load the integer read by scanf <span style="color: #00007f; font-weight: bold;">into</span> r1 <span style="color: #339933;">*/</span>
                                 <span style="color: #339933;">/*</span> So we set it as the second parameter of printf <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> address_of_message2  <span style="color: #339933;">/*</span> Set &amp;message2 as the first parameter of printf <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Call</span> printf <span style="color: #339933;">*/</span>
&nbsp;
&nbsp;
    <span style="color: #00007f; font-weight: bold;">add</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> <span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>              <span style="color: #339933;">/*</span> Discard the integer read by scanf <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>            <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #00007f; font-weight: bold;">and</span> put it <span style="color: #00007f; font-weight: bold;">in</span> lr <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr                        <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> main <span style="color: #339933;">*/</span>
&nbsp;
address_of_message1<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message1
address_of_message2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message2
address_of_format<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> format</pre></td></tr></table></div>

<p>
Most of the code is pretty straightforward. In both functions, <code>main</code> and <code>factorial</code>, we allocate 4 extra bytes on the top of the stack. In <code>factorial</code>, to keep the value of <code>r0</code>, because it will be overwritten during the recursive call (twice, as a first parameter and as the result of the recursive function call). In <code>main</code>, to keep the value entered by the user (if you recall chapter 9 we used a global variable here).</p>
<p>It is important to bear in mind that the stack, like a real stack, the last element stacked (pushed onto the top) will be the first one to be taken out the stack (popped from the top). We store <code>lr</code> and make room for a 4 bytes integer. Since this is a stack, the opposite order must be used to return the stack to its original state. We first discard the integer and then we restore the <code>lr</code>. Note that this happens as well when we reserve the stack storage for the integer using a <code>sub</code> and then we discard such storage doing the opposite operation <code>add</code>.
</p>
<h2>Can we do it better?</h2>
<p>
Note that the number of instructions that we need to push and pop data to and from the stack grows linearly with respect to the number of data items. Since ARM was designed for embedded systems, ARM designers devised a way to reduce the number of instructions we need for the «bookkeeping» of the stack. These instructions are load multiple, <code>ldm</code>, and store multiple, <code>stm</code>.
</p>
<p>
These two instructions are rather powerful and allow in a single instruction perform a lot of things. Their syntax is shown as follows. Elements enclosed in curly braces <code>{</code> and <code>}</code> may be omitted from the syntax (the effect of the instruction will vary, though).
</p>
<pre>
ldm addressing-mode Rbase{!}, register-set
stm addressing-mode Rbase{!}, register-set
</pre>
<p>
We will consider <code>addressing-mode</code> later. <code>Rbase</code> is the base address used to load to or store from the <code>register-set</code>. All 16 ARM registers may be specified in <code>register-set</code> (except <code>pc</code> in <code>stm</code>). A set of addresses is generated when executing these instructions. One address per register in the register-set. Then, each register, in ascending order, is paired with each of these addresses, also in ascending order. This way the lowest-numbered register gets the lowest memory address, and the highest-numbered register gets the highest memory address. Each pair register-address is then used to perform the memory operation: load or store. Specifying <code>!</code> means that <code>Rbase</code> will be updated. The updated value depends on <code>addressing-mode</code>.
</p>
<p>
Note that, if the registers are paired with addresses depending on their register number, it seems that they will always be loaded and stored in the same way. For instance a <code>register-set</code> containing <code>r4</code>, <code>r5</code> and <code>r6</code> will always store <code>r4</code> in the lowest address generated by the instruction and <code>r6</code> in the highest one. We can, though, specify what is considered the lowest address or the highest address. So, is <code>Rbase</code> actually the highest address or the lowest address of the multiple load/store? This is one of the two aspects that is controlled by <code>addressing-mode</code>. The second aspect relates to when the address of the memory operation changes between each memory operation.
</p>
<p>
If the value in <code>Rbase</code> is to be considered the the highest address it means that we should first decrease <code>Rbase</code> as many bytes as required by the number of registers in the <code>register-set</code> (this is 4 times the number of registers) to form the lowest address. Then we can load or store each register consecutively starting from that lowest address, always in ascending order of the register number. This addressing mode is called <em>decreasing</em> and is specified using a <q><code>d</code></q>. Conversely, if <code>Rbase</code> is to be considered the lowest address, then this is a bit easier as we can use its value as the lowest address already. We proceed as usual, loading or storing each register in ascending order of their register number. This addressing mode is called <em>increasing</em> and is specified using an <q><code>i</code></q>.
</p>
<p>
At each load or store, the address generated for the memory operation may be updated <em>after</em> or <em>before</em> the memory operation itself. We can specify this using <q><code>a</code></q> or <q><code>b</code></q>, respectively.
</p>
<p>
If we specify <code>!</code>, after the instruction, <code>Rbase</code> will have the highest address generated in the increasing mode and the lowest address generated in the decreasing mode. The final value of <code>Rbase</code> will include the final addition or subtraction if we use a mode that updates after (an <q><code>a</code></q> mode).
</p>
<p>
So we have four addressing modes, namely: <code>ia</code>, <code>ib</code>, <code>da</code> and <code>db</code>. These addressing modes are specified as <strong>suffixes</strong> of the <code>stm</code> and <code>ldm</code> instructions. So the full set of names is <code>stmia</code>, <code>stmib</code>, <code>stmda</code>, <code>stmdb</code>, <code>ldmia</code>, <code>ldmib</code>, <code>ldmda</code>, <code>ldmdb</code>. Now you may think that this is overly complicated, but we need not use all the eight modes. Only two of them are of interest to us now.
</p>
<p>
When we push something onto the stack we actually decrease the stack pointer (because in Linux the stack grows downwards). More precisely, we first decrease the stack pointer as many bytes as needed before doing the actual store on that just computed stack pointer. So the appropiate <code>addressing-mode</code> when pushing onto the stack is <code>stmdb</code>. Conversely when popping from the stack we will use <code>ldmia</code>: we increment the stack pointer after we have performed the load.
</p>
<h2>Factorial again</h2>
<p>
Before illustrating these two instructions, we will first slightly rewrite our factorial.
</p>
<p>
If you go back to the code of our factorial, there is a moment, when computing <code>n * factorial(n-1)</code>, where the initial value of <code>r0</code> is required. The value of <code>n</code> was in <code>r0</code> at the beginning of the function, but <code>r0</code> can be freely modified by called functions. We chose, in the example above, to keep a copy of <code>r0</code> in the stack in line 12. Later, in line 24, we loaded it from the stack in <code>r1</code>, just before computing the multiplication.
</p>
<p>
In our second version of factorial, we will keep a copy of the initial value of <code>r0</code> into <code>r4</code>. But <code>r4</code> is a register the value of which must be restored upon leaving a function. So we will keep the value of <code>r4</code> at the entry of the function in the stack. At the end we will restore it back from the stack. This way we can use <code>r4</code> without breaking the rules of <em>well-behaved functions</em>.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">factorial<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span>#<span style="color: #339933;">-</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> lr onto the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> r4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span>#<span style="color: #339933;">-</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r4 onto the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
                       <span style="color: #339933;">/*</span> The <span style="color: #0000ff; font-weight: bold;">stack</span> is now <span style="color: #ff0000;">8</span> <span style="color: #0000ff; font-weight: bold;">byte</span> aligned <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r4<span style="color: #339933;">,</span> r0         <span style="color: #339933;">/*</span> Keep a copy of the initial value of r0 <span style="color: #00007f; font-weight: bold;">in</span> r4 <span style="color: #339933;">*/</span>
&nbsp;
&nbsp;
    <span style="color: #00007f; font-weight: bold;">cmp</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>         <span style="color: #339933;">/*</span> compare r0 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    bne is_nonzero     <span style="color: #339933;">/*</span> if r0 != <span style="color: #ff0000;">0</span> then branch <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r0 ← <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is the return <span style="color: #339933;">*/</span>
    b end
is_nonzero<span style="color: #339933;">:</span>
                       <span style="color: #339933;">/*</span> Prepare the <span style="color: #00007f; font-weight: bold;">call</span> to factorial<span style="color: #009900; font-weight: bold;">&#40;</span>n<span style="color: #339933;">-</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">sub</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>     <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">-</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> factorial
                       <span style="color: #339933;">/*</span> After the <span style="color: #00007f; font-weight: bold;">call</span> r0 contains factorial<span style="color: #009900; font-weight: bold;">&#40;</span>n<span style="color: #339933;">-</span><span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
                       <span style="color: #339933;">/*</span> Load initial value of r0 <span style="color: #009900; font-weight: bold;">&#40;</span>that we kept <span style="color: #00007f; font-weight: bold;">in</span> r4<span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #00007f; font-weight: bold;">into</span> r1 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r4         <span style="color: #339933;">/*</span> r1 ← r4 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mul</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> r1     <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">*</span> r1 <span style="color: #339933;">*/</span>
&nbsp;
end<span style="color: #339933;">:</span>
    ldr r4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #00007f; font-weight: bold;">and</span> put it <span style="color: #00007f; font-weight: bold;">in</span> r4 <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #00007f; font-weight: bold;">and</span> put it <span style="color: #00007f; font-weight: bold;">in</span> lr <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr              <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Leave</span> factorial <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Note that the remainder of the program does not have to change. This is the cool thing of functions <img src='http://thinkingeek.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />
</p>
<p>
Ok, now pay attention to these two sequences in our new factorial version above.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>11
12
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span>#<span style="color: #339933;">-</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> lr onto the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> r4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #339933;">,</span>#<span style="color: #339933;">-</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r4 onto the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>30
31
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    ldr r4<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #00007f; font-weight: bold;">and</span> put it <span style="color: #00007f; font-weight: bold;">in</span> r4 <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #46aa03; font-weight: bold;">sp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>  <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> the top of the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #00007f; font-weight: bold;">and</span> put it <span style="color: #00007f; font-weight: bold;">in</span> lr <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Now, let&#8217;s replace them with <code>stmdb</code> and <code>ldmia</code> as explained a few paragraphs ago.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>11
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    stmdb <span style="color: #46aa03; font-weight: bold;">sp</span>!<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Push</span> r4 <span style="color: #00007f; font-weight: bold;">and</span> lr onto the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>30
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    ldmia <span style="color: #46aa03; font-weight: bold;">sp</span>!<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span>    <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Pop</span> lr <span style="color: #00007f; font-weight: bold;">and</span> r4 from the <span style="color: #0000ff; font-weight: bold;">stack</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Note that the order of the registers in the set of registers is not relevant, but the processor will handle them in ascending order, so we should write them in ascending order. GNU assembler will emit a warning otherwise. Since <code>lr</code> is actually <code>r14</code> it must go after <code>r4</code>. This means that our code is 100% equivalent to the previous one since <code>r4</code> will end in a lower address than <code>lr</code>: remember our stack grows toward lower addresses, thus <code>r4</code> which is in the top of the stack in <code>factorial</code> has the lowest address.
</p>
<p>
Remembering <code>stmdb sp!</code> and <code>ldmia sp!</code> may be a bit hard. Also, given that these two instructions will be relatively common when entering and leaving functions, GNU assembler provides two <em>mnemonics</em> <code>push</code> and <code>pop</code> for <code>stmdb sp!</code> and <code>ldmia sp!</code>, respectively. Note that these are not ARM instructions actually, just convenience names that are easier to remember.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>11
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    <span style="color: #00007f; font-weight: bold;">push</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span></pre></td></tr></table></div>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>30
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">    <span style="color: #00007f; font-weight: bold;">pop</span> <span style="color: #009900; font-weight: bold;">&#123;</span>r4<span style="color: #339933;">,</span> lr<span style="color: #009900; font-weight: bold;">&#125;</span></pre></td></tr></table></div>

<p>
That&#8217;s all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F02%2F07%2Farm-assembler-raspberry-pi-chapter-10%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%2010" id="wpa2a_12"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/02/07/arm-assembler-raspberry-pi-chapter-10/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 9</title>
		<link>http://thinkingeek.com/2013/02/02/arm-assembler-raspberry-pi-chapter-9/</link>
		<comments>http://thinkingeek.com/2013/02/02/arm-assembler-raspberry-pi-chapter-9/#comments</comments>
		<pubDate>Sat, 02 Feb 2013 19:14:13 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>
		<category><![CDATA[arm]]></category>
		<category><![CDATA[assembler]]></category>
		<category><![CDATA[function]]></category>
		<category><![CDATA[function call]]></category>
		<category><![CDATA[functions]]></category>
		<category><![CDATA[pi]]></category>
		<category><![CDATA[raspberry]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=622</guid>
		<description><![CDATA[In previous chapters we learnt the foundations of ARM assembler: registers, some arithmetic operations, loads and stores and branches. Now it is time to put everything together and add another level of abstraction to our assembler skills: functions. Why functions? Functions are a way to reuse code. If we have some code that will be [...]]]></description>
				<content:encoded><![CDATA[<p>
In previous chapters we learnt the foundations of ARM assembler: registers, some arithmetic operations, loads and stores and branches. Now it is time to put everything together and add another level of abstraction to our assembler skills: functions.
</p>
<p><span id="more-622"></span></p>
<h2>Why functions?</h2>
<p>
Functions are a way to reuse code. If we have some code that will be needed more than once, being able to reuse it is a Good Thing™. This way, we only have to ensure that the code being reused is correct. If we repeated the code whe should verify it is correct at every point. This clearly does not scale. Functions can also get parameters. This way not only we reuse code but we can use it in several ways, by passing different parameters. All this magic, though, comes at some price. A function must be a a <em>well-behaved</em> citizen.
</p>
<h2>Do&#8217;s and don&#8217;ts of a function</h2>
<p>
Assembler gives us a lot of power. But with a lot of power also comes a lot of responsibility. We can break lots of things in assembler, because we are at a very low level. An error and nasty things may happen. In order to make all functions behave in the same way, there are <em>conventions</em> in every environment that dictate how a function must behave. Since we are in a Raspberry Pi running Linux we will use the <abbr title="Procedure Call Standard for ARM Architecture®">AAPCS</abbr> (chances are that other ARM operating systems like RISCOS or Windows RT follow it). You may find this document in the ARM documentation website but I will try to summarize it in this chapter.
</p>
<h3>New special named registers</h3>
<p>
When discussing branches we learnt that <code>r15</code> was also called <code>pc</code> but we never called it <code>r15</code> anymore. Well, let&#8217;s rename from now <code>r14</code> as <code>lr</code> and <code>r13</code> as <code>sp</code>. <code>lr</code> stands for <em><strong>l</strong>ink <strong>r</strong>egister</em> and it is the address of the instruction following the instruction that <em>called us</em> (we will see later what is this). <code>sp</code> stands for <em><strong>s</strong>tack <strong>p</strong>ointer</em>. The <em>stack</em> is an area of memory owned only by the current function, the <code>sp</code> register stores the top address of that stack. For now, let&#8217;s put the stack aside. We will get it back in the next chapter.
</p>
<h3>Passing parameters</h3>
<p>
Functions can receive parameters. The first 4 parameters must be stored, sequentially, in the registers <code>r0</code>, <code>r1</code>, <code>r2</code> and <code>r3</code>. You may be wondering how to pass more than 4 parameters. We can, of course, but we need to use the stack, but we will discuss it in the next chapter. Until then, we will only pass up to 4 parameters.
</p>
<h3><q>Well behaved</q> functions</h3>
<p>
A function must adhere, at least, to the following rules if we want it to be AAPCS compliant.
</p>
<ul>
<li>A function should not make any assumption on the contents of the <code>cspr</code>. So, at the entry of a function condition codes N, Z, C and V are unknown.
<li>A function can freely modify registers <code>r0</code>, <code>r1</code>, <code>r2</code> and <code>r3</code>.
<li>A function cannot assume anything on the contents of <code>r0</code>, <code>r1</code>, <code>r2</code> and <code>r3</code> unless they are playing the role of a parameter.
<li>A function can freely modify <code>lr</code> but the value upon entering the function will be needed when leaving the function (so such value must be kept somewhere).
<li>A function can modify all the remaining registers as long as their values are restored upon leaving the function. This includes <code>sp</code> and registers <code>r4</code> to <code>r11</code>.<br />
This means that, after calling a function, we have to assume that (only) registers <code>r0</code>, <code>r1</code>, <code>r2</code>, <code>r3</code> and <code>lr</code> have been overwritten.
</ul>
<h3>Calling a function</h3>
<p>
There are two ways to call a function. If the function is statically known (meaning we know exactly which function must be called) we will use <code>bl label</code>. That label must be a label defined in the <code>.text</code> section. This is called a direct (or immediate) call. We may do indirect calls by first storing the address of the function into a register and then using <code>blx Rsource1</code>.
</p>
<p>
In both cases the behaviour is as follows: the address of the function (immediately encoded in the <code>bl</code> or using the value of the register in <code>blx</code>) is stored in <code>pc</code>. The address of the instruction following the <code>bl</code> or <code>blx</code> instruction is kept in <code>lr</code>.
</p>
<h3>Leaving a function</h3>
<p>
A well behaved function, as stated above, will have to keep the initial value of <code>lr</code> somewhere. When leaving the function, we will retrieve that value and put it in some register (it can be <code>lr</code> again but this is not mandatory). Then we will <code>bx Rsource1</code> (we could use <code>blx</code> as well but the latter would update <code>lr</code> which is useless here).
</p>
<h3>Returning data from functions</h3>
<p>
Functions must use <code>r0</code> for data that fits in 32 bit (or less). This is, C types <code>char</code>, <code>short</code>, <code>int</code>, <code>long</code> (and <code>float</code> though we have not seen floating point yet) will be returned in <code>r0</code>. For basic types of 64 bit, like C types <code>long long</code> and <code>double</code>, they will be returned in <code>r1</code> and <code>r0</code>. Any other data is returned through the stack unless it is 32 bit or less, where it will be returned in <code>r0</code>.
</p>
<p>
In the examples in previous chapters we returned the error code of the program in <code>r0</code>. This now makes sense. C&#8217;s <code>main</code> returns an <code>int</code>, which is used as the value of the error code of our program.
</p>
<h2>Hello world</h2>
<p>
Usually this is the first program you write in any high level programming language. In our case we had to learn lots of things first. Anyway, here it is. A &#8220;Hello world&#8221; in ARM assembler.
</p>
<p>
(Note to experts: since we will not discuss the stack until the next chapter, this code may look very dumb to you)
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> hello01<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
&nbsp;
greeting<span style="color: #339933;">:</span>
 <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Hello world&quot;</span>
&nbsp;
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
return<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    ldr r1<span style="color: #339933;">,</span> address_of_return     <span style="color: #339933;">/*</span>   r1 ← &amp;address_of_return <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                  <span style="color: #339933;">/*</span>   <span style="color: #339933;">*</span>r1 ← lr <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_greeting   <span style="color: #339933;">/*</span> r0 ← &amp;address_of_greeting <span style="color: #339933;">*/</span>
                                  <span style="color: #339933;">/*</span> First parameter of puts <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #46aa03; font-weight: bold;">bl</span> puts                       <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">Call</span> to puts <span style="color: #339933;">*/</span>
                                  <span style="color: #339933;">/*</span> lr ← address of next instruction <span style="color: #339933;">*/</span>
&nbsp;
    ldr r1<span style="color: #339933;">,</span> address_of_return     <span style="color: #339933;">/*</span> r1 ← &amp;address_of_return <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                  <span style="color: #339933;">/*</span> lr ← <span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr                         <span style="color: #339933;">/*</span> return from main <span style="color: #339933;">*/</span>
address_of_greeting<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> greeting
address_of_return<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> return
&nbsp;
<span style="color: #339933;">/*</span> External <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> puts</pre></td></tr></table></div>

<p>
We are going to call <code>puts</code> function. This function is defined in the C library and has the following prototype <code>int puts(const char*)</code>. It receives, as a first parameter, the address of a C-string (this is, a sequence of bytes where no byte but the last is zero). When executed it outputs that string to <code>stdout</code> (so it should appear by default to our terminal). Finally it returns the number of bytes written.
</p>
<p>
We start by defining in the <code>.data</code> the label <code>greeting</code> in lines 4 and 5. This label will contain the address of our greeting message. GNU as provides a convenient <code>.asciz</code> directive for that purpose. This directive emits as bytes as needed to represent the string plus the final zero byte. We could have used another directive <code>.ascii</code> as long as we explicitly added the final zero byte.
</p>
<p>
After the bytes of the greeting message, we make sure the next label will be 4 bytes aligned and we define a <code>return</code> label in line 8. In that label we will keep the value of <code>lr</code> that we have in <code>main</code>. As stated above, this is a requirement for a well behaved function: be able to get the original value of <code>lr</code> upon entering. So we make some room for it.
</p>
<p>
The first two instructions, lines 14 an 15, of our main function keep the value of <code>lr</code> in that <code>return</code> variable defined above. Then in line 17 we prepare the arguments for the call to <code>puts</code>. We load the address of the greeting message into <code>r0</code> register. This register will hold the first (the only one actually) parameter of <code>puts</code>. Then in line 20 we call the function. Recall that <code>bl</code> will set in <code>lr</code> the address of the instruction following it (this is the instruction in line 23). This is the reason why we copied the value of <code>lr</code> in a variable in the beginning of the <code>main</code> function, because it was going to be overwritten by <code>bl</code>.
</p>
<p>
Ok, <code>puts</code> runs and the message is printed on the <code>stdout</code>. Time to get the initial value of <code>lr</code> so we can return successfully from main. Then we return.
</p>
<p>
Is our <code>main</code> function well behaved? Yes, it keeps and gets back <code>lr</code> to leave. It only modifies <code>r0</code> and <code>r1</code>. We can assume that <code>puts</code> is well behaved as well, so everything should work fine. Plus the bonus of seeing how many bytes have been written to the output.</p>
<p>
<div class="wp_syntax"><table><tr><td class="code"><pre class="bash" style="font-family:monospace;">$ .<span style="color: #000000; font-weight: bold;">/</span>hello01 
Hello world
$ <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$?</span>
<span style="color: #000000;">12</span></pre></td></tr></table></div>

<p>
Note that &#8220;Hello world&#8221; is just 11 bytes (the final zero is not counted as it just plays the role of a finishing byte) but the program returns 12. This is because <code>puts</code> always adds a newline byte, which accounts for that extra byte.
</p>
<h2>Real interaction!</h2>
<p>
Now we have the power of calling functions we can glue them together. Let&#8217;s call printf and scanf to read a number and then print it back to the standard output.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> printf01<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
&nbsp;
<span style="color: #339933;">/*</span> First message <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
message1<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Hey, type a number: &quot;</span>
&nbsp;
<span style="color: #339933;">/*</span> Second message <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
message2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;I read the number %d\n&quot;</span>
&nbsp;
<span style="color: #339933;">/*</span> Format pattern for scanf <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
scan_pattern <span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;%d&quot;</span>
&nbsp;
<span style="color: #339933;">/*</span> Where scanf will store the number read <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
number_read<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
return<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    ldr r1<span style="color: #339933;">,</span> address_of_return        <span style="color: #339933;">/*</span> r1 ← &amp;address_of_return <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← lr <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_message1      <span style="color: #339933;">/*</span> r0 ← &amp;message1 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                        <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> to printf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_scan_pattern  <span style="color: #339933;">/*</span> r0 ← &amp;scan_pattern <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> address_of_number_read   <span style="color: #339933;">/*</span> r1 ← &amp;number_read <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> scanf                         <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> to scanf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_message2      <span style="color: #339933;">/*</span> r0 ← &amp;message2 <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> address_of_number_read   <span style="color: #339933;">/*</span> r1 ← &amp;number_read <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> r1 ← <span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                        <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> to printf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_number_read   <span style="color: #339933;">/*</span> r0 ← &amp;number_read <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r0<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> r0 ← <span style="color: #339933;">*</span>r0 <span style="color: #339933;">*/</span>
&nbsp;
    ldr lr<span style="color: #339933;">,</span> address_of_return        <span style="color: #339933;">/*</span> lr ← &amp;address_of_return <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>lr<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> lr ← <span style="color: #339933;">*</span>lr <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr                            <span style="color: #339933;">/*</span> return from main using lr <span style="color: #339933;">*/</span>
address_of_message1 <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message1
address_of_message2 <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message2
address_of_scan_pattern <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> scan_pattern
address_of_number_read <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> number_read
address_of_return <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> return
&nbsp;
<span style="color: #339933;">/*</span> External <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> printf
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> scanf</pre></td></tr></table></div>

<p>
In this example we will ask the user to type a number and then we will print it back. We also return the number in the error code, so we can check twice if everything goes as expected. For the error code check, make sure your number is lower than 255 (otherwise the error code will show only its lower 8 bits).
</p>
<pre escaped="1">
$ ./printf01 
Hey, type a number: <span style="color: blue;">123↴</span>
I read the number 123
$ ./printf01 ; echo $?
Hey, type a number: <span style="color: blue;">124↴</span>
I read the number 124
124
</pre>
<h2>Our first function</h2>
<p>
Let&#8217;s define our first function. Lets extend the previous example but multiply the number by 5.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
return2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
<span style="color: #339933;">/*</span>
mult_by_5 function
<span style="color: #339933;">*/</span>
mult_by_5<span style="color: #339933;">:</span> 
    ldr r1<span style="color: #339933;">,</span> address_of_return2       <span style="color: #339933;">/*</span> r1 ← &amp;address_of_return <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← lr <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">add</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>           <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>r0 <span style="color: #339933;">*/</span>
&nbsp;
    ldr lr<span style="color: #339933;">,</span> address_of_return2       <span style="color: #339933;">/*</span> lr ← &amp;address_of_return <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>lr<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> lr ← <span style="color: #339933;">*</span>lr <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr                            <span style="color: #339933;">/*</span> return from main using lr <span style="color: #339933;">*/</span>
address_of_return2 <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> return2</pre></td></tr></table></div>

<p>
This function will need another &#8220;<code>return</code>&#8221; variable like the one <code>main</code> uses. But this is for the sake of the example. Actually this function does not call another function. When this happens it does not need to keep <code>lr</code> as no <code>bl</code> or <code>blx</code> instruction is going to modify it. If the function wanted to use <code>lr</code> as the the <code>r14</code> general purpose register, the process of keeping the value would still be mandatory.
</p>
<p>
As you can see, once the function has computed the value, it is enough keeping it in <code>r0</code>. In this case it was pretty easy and a single instruction was enough.
</p>
<p>
The whole example follows.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> printf02<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
&nbsp;
<span style="color: #339933;">/*</span> First message <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
message1<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;Hey, type a number: &quot;</span>
&nbsp;
<span style="color: #339933;">/*</span> Second message <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
message2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;%d times 5 is %d\n&quot;</span>
&nbsp;
<span style="color: #339933;">/*</span> Format pattern for scanf <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
scan_pattern <span style="color: #339933;">:</span> <span style="color: #339933;">.</span>asciz <span style="color: #7f007f;">&quot;%d&quot;</span>
&nbsp;
<span style="color: #339933;">/*</span> Where scanf will store the number read <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
number_read<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
return<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
return2<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> <span style="color: #ff0000;">0</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
<span style="color: #339933;">/*</span>
mult_by_5 function
<span style="color: #339933;">*/</span>
mult_by_5<span style="color: #339933;">:</span> 
    ldr r1<span style="color: #339933;">,</span> address_of_return2       <span style="color: #339933;">/*</span> r1 ← &amp;address_of_return <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← lr <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">add</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> r0<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>           <span style="color: #339933;">/*</span> r0 ← r0 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span><span style="color: #339933;">*</span>r0 <span style="color: #339933;">*/</span>
&nbsp;
    ldr lr<span style="color: #339933;">,</span> address_of_return2       <span style="color: #339933;">/*</span> lr ← &amp;address_of_return <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>lr<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> lr ← <span style="color: #339933;">*</span>lr <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr                            <span style="color: #339933;">/*</span> return from main using lr <span style="color: #339933;">*/</span>
address_of_return2 <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> return2
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    ldr r1<span style="color: #339933;">,</span> address_of_return        <span style="color: #339933;">/*</span> r1 ← &amp;address_of_return <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← lr <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_message1      <span style="color: #339933;">/*</span> r0 ← &amp;message1 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                        <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> to printf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_scan_pattern  <span style="color: #339933;">/*</span> r0 ← &amp;scan_pattern <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> address_of_number_read   <span style="color: #339933;">/*</span> r1 ← &amp;number_read <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> scanf                         <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> to scanf <span style="color: #339933;">*/</span>
&nbsp;
    ldr r0<span style="color: #339933;">,</span> address_of_number_read   <span style="color: #339933;">/*</span> r0 ← &amp;number_read <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r0<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> r0 ← <span style="color: #339933;">*</span>r0 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> mult_by_5
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> r0                       <span style="color: #339933;">/*</span> r2 ← r0 <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> address_of_number_read   <span style="color: #339933;">/*</span> r1 ← &amp;number_read <span style="color: #339933;">*/</span>
    ldr r1<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> r1 ← <span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
    ldr r0<span style="color: #339933;">,</span> address_of_message2      <span style="color: #339933;">/*</span> r0 ← &amp;message2 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bl</span> printf                        <span style="color: #339933;">/*</span> <span style="color: #00007f; font-weight: bold;">call</span> to printf <span style="color: #339933;">*/</span>
&nbsp;
    ldr lr<span style="color: #339933;">,</span> address_of_return        <span style="color: #339933;">/*</span> lr ← &amp;address_of_return <span style="color: #339933;">*/</span>
    ldr lr<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>lr<span style="color: #009900; font-weight: bold;">&#93;</span>                     <span style="color: #339933;">/*</span> lr ← <span style="color: #339933;">*</span>lr <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr                            <span style="color: #339933;">/*</span> return from main using lr <span style="color: #339933;">*/</span>
address_of_message1 <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message1
address_of_message2 <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> message2
address_of_scan_pattern <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> scan_pattern
address_of_number_read <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> number_read
address_of_return <span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> return
&nbsp;
<span style="color: #339933;">/*</span> External <span style="color: #339933;">*/</span>
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> printf
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> scanf</pre></td></tr></table></div>

<p>
I want you to notice lines 58 to 62. There we prepare the call to <code>printf</code> which receives three parameters: the format and the two integers referenced in the format. We want the first integer be the number entered by the user. The second one will be that same number multiplied by 5. After the call to <code>mult_by_5</code>, <code>r0</code> contains the number entered by the user multiplied by 5. We want it to be the third parameter so we move it to <code>r2</code>. Then we load the value of the number entered by the user into <code>r1</code>. Finally we load in <code>r0</code> the address to the format message of <code>printf</code>. Note that here the order of preparing the arguments of a call is nonrelevant as long as the values are correct at the point of the call. We use the fact that we will have to overwrite <code>r0</code>, so for convenience we first copy <code>r0</code> to <code>r2</code>.
</p>
<pre escaped="1">
$ ./printf02
Hey, type a number: <span style="color: blue;">1234↴</span>
1234 times 5 is 6170
</pre>
<p>
That&#8217;s all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F02%2F02%2Farm-assembler-raspberry-pi-chapter-9%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%209" id="wpa2a_14"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/02/02/arm-assembler-raspberry-pi-chapter-9/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 8</title>
		<link>http://thinkingeek.com/2013/01/27/arm-assembler-raspberry-pi-chapter-8/</link>
		<comments>http://thinkingeek.com/2013/01/27/arm-assembler-raspberry-pi-chapter-8/#comments</comments>
		<pubDate>Sun, 27 Jan 2013 21:29:21 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>
		<category><![CDATA[addresses]]></category>
		<category><![CDATA[arm]]></category>
		<category><![CDATA[assembler]]></category>
		<category><![CDATA[indexing modes]]></category>
		<category><![CDATA[pi]]></category>
		<category><![CDATA[postindex]]></category>
		<category><![CDATA[preindex]]></category>
		<category><![CDATA[raspberry]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=546</guid>
		<description><![CDATA[In the previous chapter we saw that the second operand of most arithmetic instructions can use a shift operator which allows us to shift and rotate bits. In this chapter we will continue learning the available indexing modes of ARM instructions. This time we will focus on load and store instructions. Arrays and structures So [...]]]></description>
				<content:encoded><![CDATA[<p>
In the previous chapter we saw that the second operand of most arithmetic instructions can use a <em>shift operator</em> which allows us to shift and rotate bits. In this chapter we will continue learning the available <em>indexing modes</em> of ARM instructions. This time we will focus on load and store instructions.
</p>
<p><span id="more-546"></span></p>
<h2>Arrays and structures</h2>
<p>
So far we have been able to move 32 bits from memory to registers (load) and back to memory (store). But working on single items of 32 bits (usually called scalars) is a bit limiting. Soon we would find ourselves working on arrays and structures, even if we did not know.
</p>
<p>
An array is a sequence of items of the same kind in memory. Arrays are a foundational data structure in almost every low level language. Every array has a base address, usually denoted by the name of the array, and contains N items. Each of these items has associated a growing index, ranging from 0 to N-1 or 1 to N. Using the base address and the index we can access an item of the array. We mentioned in chapter 3 that memory could be viewed as an array of bytes. An array in memory is the same, but an item may take more than one single byte.
</p>
<p>
A structure (or record or tuple) is a sequence of items of possibly diferent kind. Each item of a structure is usually called a field. Fields do not have an associated index but an offset respect to the beginning of the structure. Structures are laid out in memory to ensure that the proper alignment is used in every field. The base address of a structure is the address of its first field. If the base address is aligned, the structure should be laid out in a way that all the field are properly aligned as well.
</p>
<p>
What do arrays and structure have to do with <em>indexing modes</em> of load and store? Well, these indexing modes are designed to make easier accessing arrays and structs.</p>
<p><h2>Defining arrays and structs</h2>
<p>
To illustrate how to work with arrays and references we will use the following C declarations and implement them in assembler.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">int</span> a<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">100</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">struct</span> my_struct
<span style="color: #009900;">&#123;</span>
  <span style="color: #993333;">char</span> f0<span style="color: #339933;">;</span>
  <span style="color: #993333;">int</span> f1<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span> b<span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
Let&#8217;s first define in our assembler the array &#8216;a&#8217;. It is just 100 integers. An integer in ARM is 32-bit wide so in our assembler code we have to make room for 400 bytes (4 * 100).
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> array01<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.data</span>
&nbsp;
<span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
a<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>skip <span style="color: #ff0000;">400</span></pre></td></tr></table></div>

<p>
In line 5 we define the symbol <code>a</code> and then we make room for 400 bytes. The directive .skip tells the assembler to advance a given number of bytes before emitting the next datum. Here we are skipping 400 bytes because our array of integers takes 400 bytes (4 bytes per each of the 100 integers). Declaring a structure is not much different.</p>
<p>
<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>7
8
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">.</span>balign <span style="color: #ff0000;">4</span>
b<span style="color: #339933;">:</span> <span style="color: #339933;">.</span>skip <span style="color: #ff0000;">8</span></pre></td></tr></table></div>

<p>
Right now you should wonder why we skipped 8 bytes when the structure itself takes just 5 bytes. Well, it does need 5 bytes to store useful information. The first field <code>f0</code> is a <code>char</code>. A <code>char</code> takes 1 byte of storage. The next field <code>f1</code> is an integer. An integer takes 4 bytes and it must be aligned at 4 bytes as well, so we have to leave 3 unused bytes between the field <code>f0</code> and the field <code>f1</code>. This unused storage put just to fulfill alignment is called <em>padding</em>. Padding should never be used by your program.
</p>
<h2>Naive approach without indexing modes</h2>
<p>
Ok, let&#8217;s write some code to initialize every item of the array <code>a[i]</code>. We will do something equivalent to the following C code.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> <span style="color: #0000dd;">100</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
  a<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> i<span style="color: #339933;">;</span></pre></td></tr></table></div>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #0000ff; font-weight: bold;">.text</span>
&nbsp;
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    ldr r1<span style="color: #339933;">,</span> addr_of_a       <span style="color: #339933;">/*</span> r1 ← &amp;a <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>              <span style="color: #339933;">/*</span> r2 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">100</span>            <span style="color: #339933;">/*</span> Have we reached <span style="color: #ff0000;">100</span> yet? <span style="color: #339933;">*/</span>
    beq end                 <span style="color: #339933;">/*</span> If so<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">leave</span> the <span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">,</span> otherwise continue <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r3<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r3 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r3<span style="color: #009900; font-weight: bold;">&#93;</span>            <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r3 ← r2 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>          <span style="color: #339933;">/*</span> r2 ← r2 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b <span style="color: #00007f; font-weight: bold;">loop</span>                  <span style="color: #339933;">/*</span> Go to the beginning of the <span style="color: #00007f; font-weight: bold;">loop</span> <span style="color: #339933;">*/</span>
end<span style="color: #339933;">:</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr
addr_of_a<span style="color: #339933;">:</span> <span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">word</span> a</pre></td></tr></table></div>

<p>
Whew! We are using lots of things we have learnt from earlier chapters. In line 14 we load the base address of the array into <code>r1</code>. The address of the array will not change so we load it once. In register <code>r2</code> we will keep the index that will range from 0 to 99. In line 17 we compare it to 100 to see if we have reached the end of the loop.
</p>
<p>
Line 19 is an important one. Here we compute the address of the item. We have in <code>r1</code> the base address and we know each item is 4 bytes wide. We know also that <code>r2</code> keeps the index of the loop which we will use to access the array element. Given an item with index <code>i</code> its address must be <code>&#038;a + 4*i</code>, since there are 4 bytes between every element of this array. So <code>r3</code> has the address of the current element in this step of the loop. In line 20 we store <code>r2</code>, this is <code>i</code>, into the memory pointed by <code>r3</code>, the <code>i</code>-th array item, this is <code>a[i]</code>.
</p>
<p>
Then we proceed to  increase <code>r2</code> and jump back for the next step of the loop.
</p>
<p>
As you can see, accessing an array involves calculating the address of the accessed item. Does the ARM instruction set provide a more compact way to do this? The answer is yes. In fact it provides several <em>indexing modes</em>.
</p>
<h2>Indexing modes</h2>
<p>
In the previous chapter the concept <em>indexing mode</em> was a bit off because we were not indexing anything. Now it makes much more sense since we are indexing an array item. ARM provides <strong>nine</strong> of these indexing modes. I will distinguish two kinds of indexing modes: non updating and updating depending on whether they feature a side-effect that we will discuss later, when dealing with updating indexing modes.
</p>
<h3>Non updating indexing modes</h3>
<ol>
<li value="1"> <code>[Rsource1, +#immediate]</code> or <code>[Rsource1, -#immediate]</code>
<p>
It justs adds (or substracts) the immediate value to form the address. This is very useful to array items the index of which is a constant in the code or fields of a structure, since their offset is always constant. In <code>Rsource1</code> we put the base address and in <code>immediate</code> the offset we want in bytes. The immediate cannot be larger than 12 bits (0..4096). When the immediate is <code>#0</code> it is like the usual we have been using <code>[Rsource1]</code>.
</p>
<p>
For example, we can set <code>a[4]</code> to 3 this way(we assume that r1 already contans the base address of a). Note that the offset is in bytes thus we need an offset of 12 (4 bytes * 3 items skipped).
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">3</span>          <span style="color: #339933;">/*</span> r2 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #339933;">,</span> <span style="color: #339933;">+</span>#<span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#93;</span>  <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#41;</span> ← r2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<li><code>[Rsource1, +Rsource2]</code> or <code>[Rsource1, -Rsource2]</code>
<p>
This is like the previous one, but the added (or substracted) offset is the value in a register. This is useful when the offset is too big for the immediate. Note that for the <code>+Rsource2</code> case, the two registers can be swapped (as this would not affect the address computed).
</p>
<p>
Example. The same as above but using a register this time.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">3</span>         <span style="color: #339933;">/*</span> r2 ← <span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">12</span>        <span style="color: #339933;">/*</span> r3 ← <span style="color: #ff0000;">12</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #339933;">,+</span>r3<span style="color: #009900; font-weight: bold;">&#93;</span>   <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1 <span style="color: #339933;">+</span> r3<span style="color: #009900; font-weight: bold;">&#41;</span> ← r2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<li><code>[Rsource1, +Rsource2, shift_operation #immediate]</code> or <code>[Rsource1, -Rsource2, shift_operation #immediate]</code>.
<p>
This one is similar to the usual shift operation we can do with other instructions. A shift operation (remember: <code>LSL</code>, <code>LSR</code>, <code>ASR</code> or <code>ROR</code>) is applied to <code>Rsource2</code>, <code>Rsource1</code> is then added (or substracted) to the result of the shift operation applied to <code>Rsource2</code>. This is useful when we need to multiply the address by some fixed amount. When accessing the items of the intege array a we had to multiply the result by 4 to get a meaningful address.
</p>
<p>
For this example, let&#8217;s first recall how we computed above the address of a single item in the array.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>19
20
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">add</span> r3<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r3 ← r1 <span style="color: #339933;">+</span> r2<span style="color: #339933;">*</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r3<span style="color: #009900; font-weight: bold;">&#93;</span>            <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r3 ← r2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
We can express this in a much more compact way (without the need of the register <code>r3</code>).
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #339933;">,</span> <span style="color: #339933;">+</span>r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#93;</span>  <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1 <span style="color: #339933;">+</span> r2<span style="color: #339933;">*</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> ← r2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

</ol>
<h3>Updating indexing modes</h3>
<p>
In these indexing modes the <code>Rsource1</code> register is updated with the address synthesized by the load or store instruction. You may be wondering why one would want to do this. A bit of detour first. Recheck the code of the array load. Why do we have to keep around the base address of the array if we are always effectively moving 4 bytes away from it? Would not it make much more sense to keep the address of the current entity? So instead of
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>19
20
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">add</span> r3<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>  <span style="color: #339933;">/*</span> r3 ← r1 <span style="color: #339933;">+</span> r2<span style="color: #339933;">*</span><span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r3<span style="color: #009900; font-weight: bold;">&#93;</span>            <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r3 ← r2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
we might want to do something like
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>        <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← r2 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span>      <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
because there is no need to compute everytime from the beginning the address of the next item (as we are accessing them sequentially). Even if this looks slightly better, it still can be improved a bit more. What if our instruction were able to upate <code>r1</code> for us? Something like this (obviously the exact syntax is not as shown)
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> Wrong syntax <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #7f007f;">&quot;and then&quot;</span> <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span></pre></td></tr></table></div>

<p>
Such indexing modes exist. There are two kinds of updating indexing modes depending on at which time <code>Rsource1</code> is updated. If <code>Rsource1</code> is updated after the load or store itself (meaning that as the address to load or store is the initial <code>Rsource1</code> value) this is a <em>post-indexing</em> accessing mode. If <code>Rsource1</code> is updated before the actual load or store (meaning that the address to load or store is the final value of <code>Rsource1</code>) this is a <em>pre-indexing</em> accessing mode. In all cases, at the end of the instruction <code>Rsource1</code> will have the value of the computation of the indexing mode. Now this sounds a bit convoluted, just look in the example above: we first load using <code>r1</code> and then we do <code>r1 ← r1 + 4</code>. This is post-indexing: we first use the value of <code>r1</code> as the address where we store the value of <code>r2</code>. Then <code>r1</code> is updated with <code>r1 + 4</code>. Now consider another hypothetic syntax.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> Wrong syntax <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span></pre></td></tr></table></div>

<p>
This is pre-indexing: we first compute <code>r1 + 4</code> and use it as the address where we store the value of <code>r2</code>. At the end of the instruction <code>r1</code> has effectively been updated too, but the updated value has already been used as the address of the load or store.
</p>
<h4>Post-indexing modes</h4>
<ol>
<li value="4"><code>[Rsource1], #+immediate</code> or <code>[Rsource1], #-immediate</code>
<p>
The value of <code>Rsource1</code> is used as the address for the load or store. Then <code>Rsource1</code> is updated with the value of <code>immediate</code> after adding (or substracting) it to <code>Rsource1</code>. Using this indexing mode we can rewrite the loop of our first example as follows:
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>16
17
18
19
20
21
22
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">100</span>            <span style="color: #339933;">/*</span> Have we reached <span style="color: #ff0000;">100</span> yet? <span style="color: #339933;">*/</span>
    beq end                 <span style="color: #339933;">/*</span> If so<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">leave</span> the <span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">,</span> otherwise continue <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #339933;">+</span><span style="color: #ff0000;">4</span>        <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← r2 then r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>          <span style="color: #339933;">/*</span> r2 ← r2 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b <span style="color: #00007f; font-weight: bold;">loop</span>                  <span style="color: #339933;">/*</span> Go to the beginning of the <span style="color: #00007f; font-weight: bold;">loop</span> <span style="color: #339933;">*/</span>
end<span style="color: #339933;">:</span></pre></td></tr></table></div>

<li><code>[Rsource1], +Rsource2</code> or <code>[Rsource1], -Rsource2</code>
<p>
Like the previous one but instead of an immediate, the value of <code>Rsource2</code> is used. As usual this can be used as a workaround when the offset is too big for the immediate value.
</p>
<li><code>[Rsource1], +Rsource2, shift_operation #immediate</code> or <code>[Rsource1], -Rsource2, shift_operation #immediate</code>
<p>
The value of <code>Rsource1</code> is used as the address for the load or store. Then <code>Rsource2</code> is applied a shift operation (<code>LSL</code>, <code>LSR</code>, <code>ASR</code> or <code>ROL</code>). The resulting value of that shift is added (or substracted) to <code>Rsource1</code>. <code>Rsource1</code> is finally updated with this last value.
</ol>
<h4>Pre-indexing modes</h4>
<p>
Pre-indexing modes may look a bit weird at first but they are useful when the computed address is going to be reused soon. Instead of recomputing it we can reuse the updated <code>Rsource1</code>.<br />
Mind the <code>!</code> symbol in these indexing modes which distinguishes them from the non updating indexing modes. </p>
<ol>
<li value="7"><code>[Rsource1, #+immediate]!</code> or <code>[Rsource1, #-immediate]!</code>
<p>
It behaves like the similar non-updating indexing mode but <code>Rsource1</code> gets updated with the computed address. Imagine we want to compute <code>a[4] = a[4] + a[4]</code>. We could do this (we assume that <code>r1</code> already has the base address of the array).</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">ldr r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">12</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">12</span> then r2 ← <span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r2       <span style="color: #339933;">/*</span> r2 ← r2 <span style="color: #339933;">+</span> r2 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>         <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← r2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<li><code>[Rsource1, +Rsource2]!</code> or <code>[Rsource1, +Rsource2]!</code>
<p>
Similar to the previous one but using a register <code>Rsource2</code> instead of an immediate.</p>
<li><code>[Rsource1, +Rsource2, shift_operation #immediate]!</code> or <code>[Rsource1, -Rsource2, shift_operation #immediate]!</code>
<p>Like to the non-indexing equivalent but Rsource1 will be updated with the address used for the load or store instruction.</p>
</ol>
<h2>Back to structures</h2>
<p>
All the examples in this chapter have used an array. Structures are a bit simpler: the offset to the fields is always constant: once we have the base address of the structure (the address of the first field) accessing a field is just an indexing mode with an offset (usually an immediate). Our current structure features, on purpose, a <code>char</code> as its first field <code>f0</code>. Currently we cannot work on scalars in memory of different size than 4 bytes. So we will postpone working on that first field for a future chapter.
</p>
<p>
For instance imagine we wanted to increment the field f1 like this.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;">b.<span style="color: #202020;">f1</span> <span style="color: #339933;">=</span> b.<span style="color: #202020;">f1</span> <span style="color: #339933;">+</span> <span style="color: #0000dd;">7</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
If <code>r1</code> contains the base address of our structure, accessing the field <code>f1</code> is pretty easy now that we know all the available indexing modes.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;">ldr r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #339933;">,</span> #<span style="color: #339933;">+</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>!  <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">4</span> then r2 ← <span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">7</span>     <span style="color: #339933;">/*</span> r2 ← r2 <span style="color: #339933;">+</span> <span style="color: #ff0000;">7</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">str</span> r2<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">&#91;</span>r1<span style="color: #009900; font-weight: bold;">&#93;</span>       <span style="color: #339933;">/*</span> <span style="color: #339933;">*</span>r1 ← r2 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Note that we use a pre-indexing mode to keep in <code>r1</code> the address of the field <code>f1</code>. This way the second store does not need to compute that address again.
</p>
<p>
That&#8217;s all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F01%2F27%2Farm-assembler-raspberry-pi-chapter-8%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%208" id="wpa2a_16"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/01/27/arm-assembler-raspberry-pi-chapter-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 7</title>
		<link>http://thinkingeek.com/2013/01/26/arm-assembler-raspberry-pi-chapter-7/</link>
		<comments>http://thinkingeek.com/2013/01/26/arm-assembler-raspberry-pi-chapter-7/#comments</comments>
		<pubDate>Sat, 26 Jan 2013 18:24:12 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>
		<category><![CDATA[arm]]></category>
		<category><![CDATA[assembler]]></category>
		<category><![CDATA[indexing modes]]></category>
		<category><![CDATA[pi]]></category>
		<category><![CDATA[raspberry]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=513</guid>
		<description><![CDATA[ARM architecture has been for long targeted at embedded systems. Embedded systems usually end being used in massively manufactured products (dishwashers, mobile phones, TV sets, etc). In this context margins are very tight so a designer will always try to spare as much components as possible (a cent saved in hundreds of thousands or even [...]]]></description>
				<content:encoded><![CDATA[<p>
ARM architecture has been for long targeted at embedded systems. Embedded systems usually end being used in massively manufactured products (dishwashers, mobile phones, TV sets, etc). In this context margins are very tight so a designer will always try to spare as much components as possible (a cent saved in hundreds of thousands or even millions of appliances may pay off). One relatively expensive component is memory although every day memory is less and less expensive. Anyway, in constrained memory environments being able to save memory is good and ARM instruction set was designed with this goal in mind. It will take us several chapters to learn all of these techniques, today we will start with one feature usually named <em>shifted operand</em>.
</p>
<p><span id="more-513"></span></p>
<h2>Indexing modes</h2>
<p>
We have seen that, except for load (<code>ldr</code>), store (<code>str</code>) and branches (<code>b</code> and <code>bXX</code>), ARM instructions take as operands either registers or immediate values. We have also seen that the first operand is usually the destination register (being <code>str</code> a notable exception as there it plays the role of source because the destination is now the memory). Instruction <code>mov</code> has another operand, a register or an immediate value. Arithmetic instructions like <code>add</code> and <code>and</code> (and many others) have two more source registers, the first of which is always a register and the second can be a register or an immediate value.
</p>
<p>
These sets of allowed operands in instructions are collectively called <em>indexing modes</em>. Today this concept will look a bit off since we will not index anything. The name <em>indexing</em> makes sense in memory operands but ARM instructions, except load and store, do not have memory operands. This is the nomenclature you will find in ARM documentation so it seems sensible to use theirs.
</p>
<p>
We can summarize the syntax of most of the ARM instructions in the following pattern
</p>
<pre>
instruction Rdest, Rsource1, source2
</pre>
<p>
There are some exceptions, mainly move (<code>mov</code>), branches, load and stores. In fact move is not so different actually.
</p>
<pre>
mov Rdest, source2
</pre>
<p>Both <code>Rdest</code> and <code>Rsource</code> must be registers. In the next section we will talk about <code>source2</code>.</p>
<p>
We will discuss the indexing modes of load and store instructions in a future chapter. Branches, on the other hand, are surprisingly simple and their single operand is just a label of our program, so there is little to discuss on indexing modes for branches.
</p>
<h2>Shifted operand</h2>
<p>
What is this misterious <code>source2</code> in the instruction patterns above? If you recall the previous chapters we have used registers or immediate values. So at least that <code>source2</code> is this: register or immediate value. You can use an immediate or a register where a <code>source2</code> is expected. Some examples follow, but we have already used them in the examples of previous chapters.
</p>
<pre>
mov r0, #1
mov r1, r0
add r2, r1, r0
add r2, r3, #4
</pre>
<p>
But <code>source2</code> can be much more than just a simple register or an immediate. In fact, when it is a register we can combine it with a <em>shift operation</em>. We already saw one of these shift operations in chapter 6. Not it is time to unveil all of them.
</p>
<ul>
<li><code>LSL #n</code><br />
<strong>L</strong>ogical <strong>S</strong>hift <strong>L</strong>eft. Shifts bits <code>n</code> times left. The <code>n</code> leftmost bits are lost and the <code>n</code> rightmost are set to zero.</p>
<li><code>LSL Rsource3</code><br />
Like the previous one but instead of an immediate the lower byte of a register specifies the amount of shifting.</p>
<li><code>LSR #n</code><br />
<strong>L</strong>ogical <strong>S</strong>hift <strong>R</strong>ight. Shifts bits <code>n</code> times right. The <code>n</code> rightmost bits are lost and the <code>n</code> leftmost bits are set to zero,</p>
<li><code>LSR Rsource3</code><br />
Like the previous one but instead of an immediate the lower byte of a register specifies the amount of shifting.</p>
<li><code>ASR #n</code><br />
<strong>A</strong>rithmetic <strong>S</strong>hift <strong>R</strong>ight. Like LSR but the leftmost bit before shifting is used instead of zero in the <code>n</code> leftmost ones.</p>
<li><code>ASR Rsource3</code><br />
Like the previous one but using a the lower byte of a register instead of an immediate.</p>
<li><code>ROR #n</code><br />
<strong>Ro</strong>tate <strong>R</strong>ight. Like LSR but the <code>n</code> rightmost bits are not lost bot pushed onto the <code>n</code> leftmost bits</p>
<li><code>ROR Rsource3</code><br />
Like the previous one but using a the lower byte of a register instead of an immediate.
</ul>
<p>
In the listing above, <code>n</code> is an immediate from 1 to 31. These extra operations may be applied to the value in the second source register (to the value, not to the register itself) so we can perform some more operations in a single instruction. For instance, ARM does not have any shift right or left instruction. You just use the <code>mov</code> instruction.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span></pre></td></tr></table></div>

<p>
You may be wondering why one would want to shift left or right the value of a register. If you recall chapter 6 we saw that shifting left (<code>LSL</code>) a value gives a value that the same as multiplying it by 2. Conversely, shifting it right (<code>ASR</code> if we use two&#8217;s complement, <code>LSR</code> otherwise) is the same as dividing by 2. Since a shift of <code>n</code> is the same as doing <code>n</code> shifts of 1, shifts actually multiply or divide a value by 2<sup>n</sup>.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>      <span style="color: #339933;">/*</span> r1 ← <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>      <span style="color: #339933;">/*</span> r1 ← <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> ASR #<span style="color: #ff0000;">3</span>      <span style="color: #339933;">/*</span> r1 ← <span style="color: #009900; font-weight: bold;">&#40;</span>r3<span style="color: #339933;">/</span><span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r3<span style="color: #339933;">,</span> <span style="color: #ff0000;">4</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> r3      <span style="color: #339933;">/*</span> r1 ← <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">16</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
We can combine it with <code>add</code> to get some useful cases.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>   <span style="color: #339933;">/*</span> r1 ← r2 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span> equivalent to r1 ← r1<span style="color: #339933;">*</span><span style="color: #ff0000;">3</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">2</span>   <span style="color: #339933;">/*</span> r1 ← r2 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">4</span><span style="color: #009900; font-weight: bold;">&#41;</span> equivalent to r1 ← r1<span style="color: #339933;">*</span><span style="color: #ff0000;">5</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
You can do something similar with <code>sub</code>.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #00007f; font-weight: bold;">sub</span> r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">3</span>  <span style="color: #339933;">/*</span> r1 ← r2 <span style="color: #339933;">-</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#41;</span> equivalent to r1 ← r2<span style="color: #339933;">*</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #339933;">-</span><span style="color: #ff0000;">7</span><span style="color: #009900; font-weight: bold;">&#41;</span></pre></td></tr></table></div>

<p>
ARM comes with a handy <code>rsb</code> (<strong>R</strong>everse <strong>S</strong>u<strong>b</strong>stract) instruction which computes <code>Rdest ← source2 - Rsource1</code> (compare it to <code>sub</code> which computes <code>Rdest ← Rsource1 - source2</code>).
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">rsb r1<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">3</span>      <span style="color: #339933;">/*</span> r1 ← <span style="color: #009900; font-weight: bold;">&#40;</span>r2<span style="color: #339933;">*</span><span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">-</span> r2 equivalent to r1 ← r2<span style="color: #339933;">*</span><span style="color: #ff0000;">7</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
Another example, a bit more contrived.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> Complicated way to multiply the initial value of r1 by <span style="color: #ff0000;">42</span> = <span style="color: #ff0000;">7</span><span style="color: #339933;">*</span><span style="color: #ff0000;">3</span><span style="color: #339933;">*</span><span style="color: #ff0000;">2</span> <span style="color: #339933;">*/</span>
rsb r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">3</span>  <span style="color: #339933;">/*</span> r1 ← <span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #339933;">*</span><span style="color: #ff0000;">8</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">-</span> r1 equivalent to r1 ← <span style="color: #ff0000;">7</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span>  <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #ff0000;">2</span><span style="color: #339933;">*</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span> equivalent to r1 ← <span style="color: #ff0000;">3</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1          <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> r1     equivalent to r1 ← <span style="color: #ff0000;">2</span><span style="color: #339933;">*</span>r1 <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
You are probably wondering why would we want to use shifts to perform multiplications. Well, the generic multiplication instruction always work but it is usually much harder to compute by our ARM processor so it may take more time. There are times where there is no other option but for many small constant values a single instruction may be more efficient.
</p>
<p>
Rotations are less useful than shifts in everyday use. They are usually used in cryptography, to reorder bits and &#8220;scramble&#8221; them. ARM does not provide a way to rotate left but we can do a <code>n</code> rotate left doing a <code>32-n</code> rotate right.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> Assume r1 is <span style="color: #ff0000;">0x12345678</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">ROR</span> #<span style="color: #ff0000;">1</span>   <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #00007f; font-weight: bold;">ror</span> <span style="color: #ff0000;">1</span><span style="color: #339933;">.</span> This is r1 ← <span style="color: #ff0000;">0x91a2b3c</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">ROR</span> #<span style="color: #ff0000;">31</span>  <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #00007f; font-weight: bold;">ror</span> <span style="color: #ff0000;">31</span><span style="color: #339933;">.</span> This is r1 ← <span style="color: #ff0000;">0x12345678</span> <span style="color: #339933;">*/</span></pre></td></tr></table></div>

<p>
That&#8217;s all for today.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F01%2F26%2Farm-assembler-raspberry-pi-chapter-7%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%207" id="wpa2a_18"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/01/26/arm-assembler-raspberry-pi-chapter-7/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ARM assembler in Raspberry Pi – Chapter 6</title>
		<link>http://thinkingeek.com/2013/01/20/arm-assembler-raspberry-pi-chapter-6/</link>
		<comments>http://thinkingeek.com/2013/01/20/arm-assembler-raspberry-pi-chapter-6/#comments</comments>
		<pubDate>Sun, 20 Jan 2013 22:14:38 +0000</pubDate>
		<dc:creator>rferrer</dc:creator>
				<category><![CDATA[Rapsberry Pi]]></category>
		<category><![CDATA[arm]]></category>
		<category><![CDATA[assembler]]></category>
		<category><![CDATA[control structures]]></category>
		<category><![CDATA[pi]]></category>
		<category><![CDATA[raspberry]]></category>

		<guid isPermaLink="false">http://thinkingeek.com/?p=479</guid>
		<description><![CDATA[Control structures In the previous chapter we learnt branch instructions. They are really powerful tools because they allow us to express control structures. Structured programming is an important milestone in better computing engineering (a foundational one, but nonetheless an important one). So being able to map usual structured programming constructs in assembler, in our processor, [...]]]></description>
				<content:encoded><![CDATA[<h2>Control structures</h2>
<p>
In the previous chapter we learnt branch instructions. They are really powerful tools because they allow us to express control structures. <em>Structured programming</em> is an important milestone in better computing engineering (a foundational one, but nonetheless an important one). So being able to map usual structured programming constructs in assembler, in our processor, is a Good Thing™.
</p>
<p><span id="more-479"></span></p>
<h2>If, then, else</h3>
<p>
Well, this one is a basic one, and in fact we already used this structure in the previous chapter. Consider the following structure, where <code>E</code> is an expression and <code>S1</code> and <code>S2</code> are statements (they may be compound statements like <code>{ SA; SB; SC; }</code>)
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span> then
   S1
<span style="color: #b1b100;">else</span>
   S2</pre></td></tr></table></div>

<p>
A possible way to express this in ARM assembler could be the following
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">if_eval<span style="color: #339933;">:</span> 
    <span style="color: #339933;">/*</span> Assembler that evaluates E <span style="color: #00007f; font-weight: bold;">and</span> updates the cpsr accordingly <span style="color: #339933;">*/</span>
bXX else <span style="color: #339933;">/*</span> Here XX is the appropiate condition <span style="color: #339933;">*/</span>
then_part<span style="color: #339933;">:</span> 
   <span style="color: #339933;">/*</span> assembler for S1<span style="color: #339933;">,</span> the <span style="color: #7f007f;">&quot;then&quot;</span> part <span style="color: #339933;">*/</span>
   b end_of_if
else<span style="color: #339933;">:</span>
   <span style="color: #339933;">/*</span> assembler for S2<span style="color: #339933;">,</span> the <span style="color: #7f007f;">&quot;else&quot;</span> part <span style="color: #339933;">*/</span>
end_of_if<span style="color: #339933;">:</span></pre></td></tr></table></div>

<p>
If there is no else part, we can replace <code>bXX else</code> with <code>bXX end_of_if</code>.
</p>
<h2>Loops</h2>
<p>
This is another usual one in structured programming. While there are several types of loops, actually all reduce to the following structure.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>E<span style="color: #009900;">&#41;</span>
  S</pre></td></tr></table></div>

<p>
Supposedly <code>S</code> makes something so <code>E</code> eventually becomes false and the loop is left. Otherwise we would stay in the loop forever (sometimes this is what you want but not in our examples). A way to implement these loops is as follows.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="asm" style="font-family:monospace;">while_condition <span style="color: #339933;">:</span> <span style="color: #339933;">/*</span> assembler to evaluate E <span style="color: #00007f; font-weight: bold;">and</span> update cpsr <span style="color: #339933;">*/</span>
  bXX end_of_loop  <span style="color: #339933;">/*</span> If E is false<span style="color: #339933;">,</span> then <span style="color: #00007f; font-weight: bold;">leave</span> the <span style="color: #00007f; font-weight: bold;">loop</span> right now <span style="color: #339933;">*/</span>
  <span style="color: #339933;">/*</span> assembler of S <span style="color: #339933;">*/</span>
  b while_condition <span style="color: #339933;">/*</span> Unconditional branch to the beginning <span style="color: #339933;">*/</span>
end_of_loop<span style="color: #339933;">:</span></pre></td></tr></table></div>

<p>
A common loop involves iterating from a single range of integers, like in
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">=</span> L<span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> N<span style="color: #339933;">;</span> i <span style="color: #339933;">+=</span> K<span style="color: #009900;">&#41;</span>
  S</pre></td></tr></table></div>

<p>
But this is nothing but
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;">  i <span style="color: #339933;">=</span> L<span style="color: #339933;">;</span>
  <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">&lt;</span> N<span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
     S<span style="color: #339933;">;</span>
     i <span style="color: #339933;">+=</span> K<span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>
So we do not have to learn a new way to implement the loop itself.
</p>
<h2>1 + 2 + 3 + 4 + &#8230; + 22</h2>
<p>
As a first example lets sum all the numbers from 1 to 22 (I&#8217;ll tell you later why I chose 22). The result of the sum is <code>253</code> (check it with a <a href="https://www.google.es/#q=1%2B2%2B3%2B4%2B5%2B6%2B7%2B8%2B9%2B10%2B11%2B12%2B13%2B14%2B15%2B16%2B17%2B18%2B19%2B20%2B21%2B22">calculator</a>). I know it makes little sense to compute something the result of which we know already, but this is just an example.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> loop01<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.text</span>
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>       <span style="color: #339933;">/*</span> r1 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>       <span style="color: #339933;">/*</span> r2 ← <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span> 
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">22</span>      <span style="color: #339933;">/*</span> compare r2 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">22</span> <span style="color: #339933;">*/</span>
    bgt end          <span style="color: #339933;">/*</span> branch if r2 &gt; <span style="color: #ff0000;">22</span> to end <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r2   <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> r1 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>   <span style="color: #339933;">/*</span> r2 ← r2 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b <span style="color: #00007f; font-weight: bold;">loop</span>
end<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> r1       <span style="color: #339933;">/*</span> r0 ← r1 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>

<p>
Here we are counting from 1 to 22. We will use the register <code>r2</code> as the counter. As you can see in line 6 we initialize it to 1. The sum will be accumulated in the register <code>r1</code>, at the end of the program we move the contents of <code>r1</code> into <code>r0</code> to return the result of the sum as the error code of the program (we could have used <code>r0</code> in all the code and avoid this final <code>mov</code> but I think it is clearer this way).
</p>
<p>
In line 8 we compare <code>r2</code> (remember, the counter that will go from 1 to 22) to 22. This will update the <code>cpsr</code> thus in line 9 we can check if the comparison was such that r2 was greater than 22. If this is the case, we end the loop by branching to <code>end</code>. Otherwise we add the current value of <code>r2</code> to the current value of <code>r1</code> (remember, in <code>r1</code> we accumulate the sum from 1 to 22).
</p>
<p>
Line 11 is an important one. We increase the value of <code>r2</code>, because we are counting from 1 to 22 and we already added the current counter value in <code>r2</code> to the result of the sum in <code>r1</code>. Then at line 12 we branch back at the beginning of the loop. Note that if line 11 was not there we would hang as the comparison in line 8 would always be false and we would never leave the loop in line 9!
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="bash" style="font-family:monospace;">$ .<span style="color: #000000; font-weight: bold;">/</span>loop01; <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$?</span>
<span style="color: #000000;">253</span></pre></td></tr></table></div>

<p>
Well, now you could change the line 8 and try with let&#8217;s say, #100. The result should be 5050.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="bash" style="font-family:monospace;">$ .<span style="color: #000000; font-weight: bold;">/</span>loop01; <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$?</span>
<span style="color: #000000;">186</span></pre></td></tr></table></div>

<p>
What happened? Well, it happens that in Linux the error code of a program is a number from 0 to 255 (8 bits). If the result is 5050, only the lower 8 bits of the number are used. 5050 in binary is <code>1001110111010</code>, its lower 8 bits are <code>10111010</code> which is exactly 186. How can we check the computed <code>r1</code> is 5050 before ending the program? Let&#8217;s use GDB.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="gdb" style="font-family:monospace;">$ gdb <span style="color: #442886; font-weight:bold;">loop
...</span>
<span style="font-weight:bold;">&#40;</span>gdb<span style="font-weight:bold;">&#41;</span> start
Temporary breakpoint <span style="">1</span> at <span style="color: #555;">0x8390</span>
Starting program: /home/roger/asm/chapter06/loop01 
&nbsp;
Temporary breakpoint <span style="">1</span>, <span style="color: #555;">0x00008390</span> in <span style="color: #442886; font-weight:bold;">main</span> <span style="font-weight:bold;">&#40;</span><span style="font-weight:bold;">&#41;</span>
<span style="font-weight:bold;">&#40;</span>gdb<span style="font-weight:bold;">&#41;</span> disas main,+<span style="font-weight:bold;">&#40;</span><span style="">9</span>*<span style="">4</span><span style="font-weight:bold;">&#41;</span>
Dump of assembler code from <span style="color: #0057AE; text-style:italic;"><span style="color: #555;">0x8390</span> to <span style="color: #555;">0x83b4</span>:</span>
   <span style="color: #555;">0x00008390</span> &lt;main+<span style="">0</span>&gt;:	mov	r1, #<span style="">0</span>
   <span style="color: #555;">0x00008394</span> &lt;main+<span style="">4</span>&gt;:	mov	r2, #<span style="">1</span>
   <span style="color: #555;">0x00008398</span> &lt;loop+<span style="">0</span>&gt;:	cmp	r2, #<span style="">100</span>	; <span style="color: #555;">0x64</span>
   <span style="color: #555;">0x0000839c</span> &lt;loop+<span style="">4</span>&gt;:	bgt	<span style="color: #555;">0x83ac</span> &lt;end&gt;
   <span style="color: #555;">0x000083a0</span> &lt;loop+<span style="">8</span>&gt;:	add	r1, r1, r2
   <span style="color: #555;">0x000083a4</span> &lt;loop+<span style="">12</span>&gt;:	add	r2, r2, #<span style="">1</span>
   <span style="color: #555;">0x000083a8</span> &lt;loop+<span style="">16</span>&gt;:	b	<span style="color: #555;">0x8398</span> &lt;loop&gt;
   <span style="color: #555;">0x000083ac</span> &lt;end+<span style="">0</span>&gt;:	mov	r0, r1
   <span style="color: #555;">0x000083b0</span> &lt;end+<span style="">4</span>&gt;:	bx	lr
End of assembler dump.</pre></td></tr></table></div>

<p>
Let&#8217;s tell gdb to stop at <code>0x000083ac</code>, right before executing <code>mov r0, r1</code>.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="gdb" style="font-family:monospace;"><span style="font-weight:bold;">&#40;</span>gdb<span style="font-weight:bold;">&#41;</span> break <span style="color: #442886; font-weight:bold;">*<span style="color: #555;">0x000083ac</span></span>
<span style="font-weight:bold;">&#40;</span>gdb<span style="font-weight:bold;">&#41;</span> cont
Continuing.
&nbsp;
Breakpoint <span style="">2</span>, <span style="color: #555;">0x000083ac</span> in <span style="color: #442886; font-weight:bold;">end</span> <span style="font-weight:bold;">&#40;</span><span style="font-weight:bold;">&#41;</span>
<span style="font-weight:bold;">&#40;</span>gdb<span style="font-weight:bold;">&#41;</span> disas
Dump of assembler code for function end:
=&gt; <span style="color: #555;">0x000083ac</span> &lt;+<span style="">0</span>&gt;:	mov	r0, r1
   <span style="color: #555;">0x000083b0</span> &lt;+<span style="">4</span>&gt;:	bx	lr
End of assembler <span style="color: #442886; font-weight:bold;">dump.</span>
<span style="font-weight:bold;">&#40;</span>gdb<span style="font-weight:bold;">&#41;</span> info register r1
r1             <span style="color: #555;">0x13ba</span>	<span style="">5050</span></pre></td></tr></table></div>

<p>
Great, this is what we expected but we could not see due to limits in the error code.
</p>
<p>
Maybe you have noticed that something odd happens with our labels being identified as functions. We will address this issue in a future chapter, this is mostly harmless though.
</p>
<h2>3n + 1</h2>
<p>
Let&#8217;s make another example a bit more complicated. This is the famous <em>3n + 1</em> problem also known as the <a href="http://en.wikipedia.org/wiki/Collatz_conjecture">Collatz conjecture</a>. Given a number <code>n</code> we will divide it by 2 if it is even and multiply it by 3 and add one if it is odd.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>n <span style="color: #339933;">%</span> <span style="color: #0000dd;">2</span> <span style="color: #339933;">==</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
  n <span style="color: #339933;">=</span> n <span style="color: #339933;">/</span> <span style="color: #0000dd;">2</span><span style="color: #339933;">;</span>
<span style="color: #b1b100;">else</span>
  n <span style="color: #339933;">=</span> <span style="color: #0000dd;">3</span><span style="color: #339933;">*</span>n <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>
Before continuing, our ARM processor is able to multiply two numbers but we should learn a new instruction <code>mul</code> which would detour us a bit. Instead we will use the following identity <code>3 * n = 2*n + n</code>. We do not really know how to multiply or divide by two yet, we will study this in a future chapter, so for now just assume it works as shown in the assembler below.
</p>
<p>
Collatz conjecture states that, for any number <code>n</code>, repeatedly applying this procedure will eventually give us the number 1. Theoretically it could happen that this is not the case. So far, no such number has been found, but it has not been proved otherwise. If we want to repeatedly apply the previous procedure, our program is doing something like this.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="c" style="font-family:monospace;">n <span style="color: #339933;">=</span> ...<span style="color: #339933;">;</span>
<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>n <span style="color: #339933;">!=</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>n <span style="color: #339933;">%</span> <span style="color: #0000dd;">2</span> <span style="color: #339933;">==</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
     n <span style="color: #339933;">=</span> n <span style="color: #339933;">/</span> <span style="color: #0000dd;">2</span><span style="color: #339933;">;</span>
  <span style="color: #b1b100;">else</span>
     n <span style="color: #339933;">=</span> <span style="color: #0000dd;">3</span><span style="color: #339933;">*</span>n <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>
If the Collatz conjecture were false, there would exist some <code>n</code> for which the code above would hang, never reaching 1. But as I said, no such number has been found.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> collatz<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.text</span>
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">123</span>           <span style="color: #339933;">/*</span> r1 ← <span style="color: #ff0000;">123</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>             <span style="color: #339933;">/*</span> r2 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>             <span style="color: #339933;">/*</span> compare r1 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    beq end                <span style="color: #339933;">/*</span> branch to end if r1 == <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">and</span> r3<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r3 ← r1 &amp; <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r3<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>             <span style="color: #339933;">/*</span> compare r3 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    bne odd                <span style="color: #339933;">/*</span> branch to odd if r3 != <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
even<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> ASR #<span style="color: #ff0000;">1</span>     <span style="color: #339933;">/*</span> r1 ← <span style="color: #009900; font-weight: bold;">&#40;</span>r1 &gt;&gt; <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    b end_loop
odd<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> <span style="color: #00007f; font-weight: bold;">LSL</span> #<span style="color: #ff0000;">1</span> <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #009900; font-weight: bold;">&#40;</span>r1 &lt;&lt; <span style="color: #ff0000;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
&nbsp;
end_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>         <span style="color: #339933;">/*</span> r2 ← r2 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b <span style="color: #00007f; font-weight: bold;">loop</span>                 <span style="color: #339933;">/*</span> branch to <span style="color: #00007f; font-weight: bold;">loop</span> <span style="color: #339933;">*/</span>
&nbsp;
end<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> r2
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>

<p>
In <code>r1</code> we will keep the number <code>n</code>. In this case we will use the number 123. 123 reaches 1 in 46 steps: [123, 370, 185, 556, 278, 139, 418, 209, 628, 314, 157, 472, 236, 118, 59, 178, 89, 268, 134, 67, 202, 101, 304, 152, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1]. We will count the number of steps in register <code>r2</code>. So we initialize <code>r1</code> with 123 and <code>r2</code> with 0 (no step has been performed yet).
</p>
<p>
At the beginning of the loop, in lines 8 and 9, we check if <code>r1</code> is 1. So we compare it with 1 and if it is equal we leave the loop branching to <code>end</code>.
</p>
<p>
Now we know that <code>r1</code> is not 1, so we proceed to check if it is even or odd. To do this we use a new instruction <code>and</code> which performs a <em>bitwise and operation</em>. An even number will have the least significant bit (LSB) to 0, while an odd number will have the LSB to 1. So a bitwise and using 1 will return 0 or 1 on even or odd numbers, respectively. In line 11 we keep the result of the bitwise and in <code>r3</code> register and then, in line 12, we compare it against 0. If it is not zero then we branch to <code>odd</code>, otherwise we continue on the <code>even</code> case.</p>
<p><p>
Now some magic happens in line 15. This is a combined operation that ARM allows us to do. This is a <code>mov</code> but we do not move the value of <code>r1</code> directly to <code>r1</code> (which would be doing nothing) but first we do an <em>arithmetic shift right</em> (ASR) to the value of <code>r1</code> (to the value, no the register itself). Then this shifted value is moved to the register <code>r1</code>. An <em>arithmetic shift right</em> shifts all the bits of a register to the right: the rightmost bit is effectively discarded and the leftmost is set to the same value as the leftmost bit prior the shift. Shifting right one bit to a number is the same as dividing that number by 2. So this <code>mov r1, r1, ASR #1</code> is actually doing <code>r1 ← r1 / 2</code>.</p>
<p><p>
Some similar magic happens for the even case in line 18. In this case we are doing an <code>add</code>. The first and second operands must be registers (destination operand and the first source operand). The third is combined with a <em>logical shift left</em> (LSL). The value of the operand is shifted left 1 bit: the leftmost bit is discarded and the rightmost bit is set to 0. This is effectively multiplying the value by 2. So we are adding <code>r1</code> (which keeps the value of <code>n</code>) to <code>2*r1</code>. This is <code>3*r1</code>, so <code>3*n</code>. We keep this value in <code>r1</code> again. In line 19 we add 1 to that value, so <code>r1</code> ends having the value <code>3*n+1</code> that we wanted.
</p>
<p>
Do not worry very much now about these LSL and ASR. Just take them for granted now. In a future chapter we will see them in more detail.
</p>
<p>
Finally, at the end of the loop, in line 22 we update <code>r2</code> (remember it keeps the counter of our steps) and then we branch back to the beginning of the loop. Before ending the program we move the counter to <code>r0</code> so we return the number of steps we did to reach 1.
</p>

<div class="wp_syntax"><table><tr><td class="code"><pre class="bash" style="font-family:monospace;">$ .<span style="color: #000000; font-weight: bold;">/</span>collatz; <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #007800;">$?</span>
<span style="color: #000000;">46</span></pre></td></tr></table></div>

<p>
Great.
</p>
<p>
That&#8217;s all for today.
</p>
<h2>Postscript</h2>
<p>
Kevin Millikin rightly pointed (in a comment below) that usually a loop is not implemented in the way shown above. In fact Kevin says that a better way to do the loop of <code>loop01.s</code> is as follows.
</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #339933;">/*</span> <span style="color: #339933;">--</span> loop02<span style="color: #339933;">.</span>s <span style="color: #339933;">*/</span>
<span style="color: #0000ff; font-weight: bold;">.text</span>
<span style="color: #339933;">.</span><span style="color: #0000ff; font-weight: bold;">global</span> main
main<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r1<span style="color: #339933;">,</span> #<span style="color: #ff0000;">0</span>       <span style="color: #339933;">/*</span> r1 ← <span style="color: #ff0000;">0</span> <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>       <span style="color: #339933;">/*</span> r2 ← <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
    b check_loop     <span style="color: #339933;">/*</span> unconditionally jump <span style="color: #0000ff; font-weight: bold;">at</span> the end of the <span style="color: #00007f; font-weight: bold;">loop</span> <span style="color: #339933;">*/</span>
<span style="color: #00007f; font-weight: bold;">loop</span><span style="color: #339933;">:</span> 
    <span style="color: #00007f; font-weight: bold;">add</span> r1<span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> r2   <span style="color: #339933;">/*</span> r1 ← r1 <span style="color: #339933;">+</span> r1 <span style="color: #339933;">*/</span>
    <span style="color: #00007f; font-weight: bold;">add</span> r2<span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">1</span>   <span style="color: #339933;">/*</span> r2 ← r2 <span style="color: #339933;">+</span> <span style="color: #ff0000;">1</span> <span style="color: #339933;">*/</span>
check_loop<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">cmp</span> r2<span style="color: #339933;">,</span> #<span style="color: #ff0000;">22</span>      <span style="color: #339933;">/*</span> compare r2 <span style="color: #00007f; font-weight: bold;">and</span> <span style="color: #ff0000;">22</span> <span style="color: #339933;">*/</span>
    ble <span style="color: #00007f; font-weight: bold;">loop</span>         <span style="color: #339933;">/*</span> branch if r2 &lt;= <span style="color: #ff0000;">22</span> to the beginning of the <span style="color: #00007f; font-weight: bold;">loop</span> <span style="color: #339933;">*/</span>
end<span style="color: #339933;">:</span>
    <span style="color: #00007f; font-weight: bold;">mov</span> r0<span style="color: #339933;">,</span> r1       <span style="color: #339933;">/*</span> r0 ← r1 <span style="color: #339933;">*/</span>
    <span style="color: #46aa03; font-weight: bold;">bx</span> lr</pre></td></tr></table></div>

<p>
If you count the number of instruction in the two codes, there are 9 instructions in both. But if you look carefully in Kevin&#8217;s proposal you will see that by unconditionally branching to the end of the loop, and reversing the condition check, we can skip one branch thus reducing the number of instructions of the loop itself from 5 to 4.
</p>
<p>
There is another advantage in this second version, though: there is only one branch in the loop itself as we resort to <em>implicit sequencing</em> to reach again the two instructions performing the check. For reasons beyond the scope of this post, the execution of a branch instruction may negatively affect the performance of our programs. Processors have mechanisms to mitigate the performance loss due to branches (and in fact the processor in the Raspberry Pi does have them). But avoiding a branch instruction entirely avoids the potential performance penalization of executing a branch instruction.
</p>
<p>
While we do not care very much now about the performance of our assembler. However, I thought it was worth developing a bit more Kevin&#8217;s comment.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fthinkingeek.com%2F2013%2F01%2F20%2Farm-assembler-raspberry-pi-chapter-6%2F&amp;title=ARM%20assembler%20in%20Raspberry%20Pi%20%E2%80%93%20Chapter%206" id="wpa2a_20"><img src="http://thinkingeek.com/wp-content/plugins/add-to-any/share_save_120_16.gif" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://thinkingeek.com/2013/01/20/arm-assembler-raspberry-pi-chapter-6/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
