So in the first part of this experiment we saw a simple strategy to spill general-purpose registers into floating-point registers implemented in the RISC-V backend of LLVM.

In this chapter, let’s see the results and some other interesting facts learnt during the process.

Recap

As a recap, we were using this code as an example.

example.c

1
2
3
4
5
6
7
8
9
enum { N = 64 };

void f(int (*a)[N], int (*b)[N], int (*c)[N]) {
  for (int i = 0; i < N; i++)
#pragma clang loop unroll_count(16)
    for (int j = 0; j < N; j++)
      for (int k = 0; k < N; k++)
        c[i][j] += a[i][k] * b[k][j];
}

It exposes enough register pressure (forced by unrolling one of the loops) that it will need to spill registers. Recall that LLVM uses the same infrastructure to preserve and restore callee-saved registers and actual spills caused by the register allocation.

At the beginning of the function, in the prologue, we see a bunch of callee-saved registers being saved.

example.s

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
f:                                      # @f
        addi    sp, sp, -256    # Grow the stack
        sd      ra, 248(sp)
        sd      s0, 240(sp)
        sd      s1, 232(sp)
        sd      s2, 224(sp)
        sd      s3, 216(sp)
        sd      s4, 208(sp)
        sd      s5, 200(sp)
        sd      s6, 192(sp)
        sd      s7, 184(sp)
        sd      s8, 176(sp)
        sd      s9, 168(sp)
        sd      s10, 160(sp)
        sd      s11, 152(sp)

A bunch of addresses computed by the program were spilled by the register allocation.

example.s

24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
        sd      a2, 128(sp)
        sd      zero, 136(sp)
        addi    a2, a1, 4
        sd      a2, 112(sp)
        addi    a2, a1, 8
        sd      a2, 104(sp)
        addi    a2, a1, 12
        sd      a2, 96(sp)
        addi    a2, a1, 16
        sd      a2, 88(sp)
        addi    a2, a1, 20
        sd      a2, 80(sp)
        addi    a2, a1, 24
        sd      a2, 72(sp)
        addi    a2, a1, 28
        sd      a2, 64(sp)
        addi    a2, a1, 32
        sd      a2, 56(sp)
        addi    a2, a1, 36
        sd      a2, 48(sp)
        addi    a2, a1, 40
        sd      a2, 40(sp)
        addi    a2, a1, 44
        sd      a2, 32(sp)
        addi    a2, a1, 48
        sd      a2, 24(sp)
        addi    a2, a1, 52
        sd      a2, 16(sp)
        addi    a2, a1, 56
        sd      a2, 8(sp)
        sd      a1, 120(sp)
        addi    a1, a1, 60
        sd      a1, 0(sp)

The outermost loop, loads those values.

example.s

77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
        mv      s10, zero
        ld      a4, 0(sp)
        ld      s11, 8(sp)
        ld      ra, 16(sp)
        ld      s6, 24(sp)
        ld      s7, 32(sp)
        ld      s2, 40(sp)
        ld      t6, 48(sp)
        ld      t5, 56(sp)
        ld      t4, 64(sp)
        ld      t3, 72(sp)
        ld      t2, 80(sp)
        ld      t1, 88(sp)
        ld      t0, 96(sp)
        ld      a7, 104(sp)
        ld      a6, 112(sp)
        ld      s8, 120(sp)

There are also a couple more of spills stored and loaded in the loop (not shown).

At the epilog of the function we restore the callee-saved registers.

example.s

467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
        ld      s11, 152(sp)
        ld      s10, 160(sp)
        ld      s9, 168(sp)
        ld      s8, 176(sp)
        ld      s7, 184(sp)
        ld      s6, 192(sp)
        ld      s5, 200(sp)
        ld      s4, 208(sp)
        ld      s3, 216(sp)
        ld      s2, 224(sp)
        ld      s1, 232(sp)
        ld      s0, 240(sp)
        ld      ra, 248(sp)
        addi    sp, sp, 256
        ret

Results

Ok, let’s compile with -mllvm -riscv-soften-spills.

The callee-saved registers are all stored in floating point registers now.

example.s

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
f:                                      # @f
# %bb.0:                                # %entry
        addi    sp, sp, -96
        fmv.d.x ft0, ra
        fmv.d.x ft1, s0
        fmv.d.x ft2, s1
        fmv.d.x ft3, s2
        fmv.d.x ft4, s3
        fmv.d.x ft5, s4
        fmv.d.x ft6, s5
        fmv.d.x ft7, s6
        fmv.d.x fa0, s7
        fmv.d.x fa1, s8
        fmv.d.x fa2, s9
        fmv.d.x fa3, s10
        fmv.d.x fa4, s11

And some of the spilled values, as well. But not all of them, we run out of FPRs before we can keep all of them.

example.s

26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
        fmv.d.x fa5, a2
        fmv.d.x fa6, zero
        addi    a2, a1, 4
        fmv.d.x fa7, a2
        addi    a2, a1, 8
        fmv.d.x ft8, a2
        addi    a2, a1, 12
        fmv.d.x ft9, a2
        addi    a2, a1, 16
        fmv.d.x ft10, a2
        addi    a2, a1, 20
        fmv.d.x ft11, a2
        addi    a2, a1, 24
        sd      a2, 72(sp)
        addi    a2, a1, 28
        sd      a2, 64(sp)
        addi    a2, a1, 32
        sd      a2, 56(sp)
        addi    a2, a1, 36
        sd      a2, 48(sp)
        addi    a2, a1, 40
        sd      a2, 40(sp)
        addi    a2, a1, 44
        sd      a2, 32(sp)
        addi    a2, a1, 48
        sd      a2, 24(sp)
        addi    a2, a1, 52
        sd      a2, 16(sp)
        addi    a2, a1, 56
        sd      a2, 8(sp)
        sd      a1, 80(sp)
        addi    a1, a1, 60
        sd      a1, 0(sp)

Now the loop reloads some of the values from the stack and others from FPR registers.

example.s

77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
        mv      s10, zero
        ld      a4, 0(sp)
        ld      s11, 8(sp)
        ld      ra, 16(sp)
        ld      s6, 24(sp)
        ld      s7, 32(sp)
        ld      s2, 40(sp)
        ld      t6, 48(sp)
        ld      t5, 56(sp)
        ld      t4, 64(sp)
        ld      t3, 72(sp)
        fmv.x.d t2, ft11
        fmv.x.d t1, ft10
        fmv.x.d t0, ft9
        fmv.x.d a7, ft8
        fmv.x.d a6, fa7
        ld      s8, 80(sp)

Finally we restore the registers in the epilog. All from FPRs.

example.s

467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
        fmv.x.d s11, fa4
        fmv.x.d s10, fa3
        fmv.x.d s9, fa2
        fmv.x.d s8, fa1
        fmv.x.d s7, fa0
        fmv.x.d s6, ft7
        fmv.x.d s5, ft6
        fmv.x.d s4, ft5
        fmv.x.d s3, ft4
        fmv.x.d s2, ft3
        fmv.x.d s1, ft2
        fmv.x.d s0, ft1
        fmv.x.d ra, ft0
        addi    sp, sp, 96
        ret

Possible improvement

Callee-saved registers are only stored and loaded in the epilog and the prolog just once (however that this might not be true when using shrink-wrapping). So it may be worth first assigining true spills/reloads first to FPR and then the frame indexes related to callee-saved registers.

Our algorithm is super simple and works in a first-found first-mapped basis. If we want to do something smarter we may have to do a first pass to gather all the eligible spills/reloads and then prioritize those that we know are true spills.

Issues found

Marking frame indexes dead that late in the pipeline of LLVM was the source of a few surprises. I am still pondering if some of these might be a bug, so take my evaluation with a pinch of salt.

PrologEpilogInserter.cpp needed a couple of fixes.

A first one here:

llvm/lib/Codegen/PrologEpilogInserter.cpp

815
816
817
818
819
820
821
822
823
824
825
   // If there are fixed sized objects that are preallocated in the local area,
   // non-fixed objects can't be allocated right at the start of local area.
   // Adjust 'Offset' to point to the end of last fixed sized preallocated
   // object.
   for (int i = MFI.getObjectIndexBegin(); i != 0; ++i) {
     if (MFI.getStackID(i) !=
         TargetStackID::Default) // Only allocate objects on the default stack.
       continue;
 
+    if (MFI.isDeadObjectIndex(i))
+        continue;

Another one later on

llvm/lib/Codegen/PrologEpilogInserter.cpp

842
843
844
845
846
847
848
849
850
851
   // First assign frame offsets to stack objects that are used to spill
   // callee saved registers.
   if (StackGrowsDown) {
     for (unsigned i = MinCSFrameIndex; i <= MaxCSFrameIndex; ++i) {
       if (MFI.getStackID(i) !=
           TargetStackID::Default) // Only allocate objects on the default stack.
         continue;
 
+      if (MFI.isDeadObjectIndex(i))
+        continue;

I think those two are latent bugs in LLVM that are never hit because no existing code is marking frame indexes associated to callee-saved registers as dead.

In RISCVFrameLowering.cpp we have broken the assumption that all the callee-saved are in the stack, so the CFI directives (used for debugging and stack unwinding) are wrong. For now let’s skip them.

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp

303
304
305
306
307
308
309
310
   // Iterate over list of callee-saved registers and emit .cfi_offset
   // directives.
   for (const auto &Entry : CSI) {
     int FrameIdx = Entry.getFrameIdx();
+    // FIXME: We should emit CFI directives in case the callee is now in another
+    // register.
+    if (FrameIdx >= 0 && MFI.isDeadObjectIndex(FrameIdx))
+      continue;

This is a bug in this experiment.

Finally in RISCVInstrInfo.cpp the function that tells us if a store is a store to the stack seems wrong.

llvm/lib/Target/RISCV/RISCVInstrInfo.cpp

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
 unsigned RISCVInstrInfo::isStoreToStackSlot(const MachineInstr &MI,
                                             int &FrameIndex) const {
   switch (MI.getOpcode()) {
   default:
     return 0;
   case RISCV::SB:
   case RISCV::SH:
   case RISCV::SW:
   case RISCV::FSW:
   case RISCV::SD:
   case RISCV::FSD:
     break;
   }
 
-  if (MI.getOperand(0).isFI() && MI.getOperand(1).isImm() &&
-      MI.getOperand(1).getImm() == 0) {
-    FrameIndex = MI.getOperand(0).getIndex();
-    return MI.getOperand(2).getReg();
+  if (MI.getOperand(1).isFI() && MI.getOperand(2).isImm() &&
+      MI.getOperand(2).getImm() == 0) {
+    FrameIndex = MI.getOperand(1).getIndex();
+    return MI.getOperand(0).getReg();
   }
 
   return 0;
 }

I think this is a legitimate bug in the RISC-V backend of LLVM. I am working on a patch for this but I first need to find a case that shows a difference so I can write a test and so far none of the regression tests show any change. Looks like I will have to dig deeper, for instance in the LLVM test-suite.

Discussion

I think it may be worth exploring being able to use those kind of mini-memories materialized in the form of registers that have popped-up in modern architectures. For example vector registers of SIMD ISAs might be also useable to stash data: a function may be able to stash several values in the different lanes.

Minimizing the stack traffic may have the effect that less pressure is put to the first-level cache (the stack has very high locality) and so it might be possible to devote more cache lines to other parts of the working set.

This of course is only realistic if cross-register bank copies can be performed with a latency better than a load or store.

Finally when there are function calls, we need a more sophisticated approach than the one sketched in this basic exercise.

An alternative approach might be teaching the register allocator that he can consider places to spill/reload data other than the stack. However this seems non-obvious as we are now spilling GPRs to FPRS but we might want the opposite scenario. Register Allocation seems a concern complex enough, so softening the spills at the same time might be harder than just doing that process later. In fact in hindsight, after Register Allocation, it may be easier to observe if a function leans to spills of one kind of register over another (e.g. majority of GPRs vs majority of FPRs) so to choose what is the best direction when spilling into other bank registers.