In the last two instalments of this series, we saw a few alternatives to compute the last bit of a mask. Turns out there is still a new one, but it relies on a policy of RVV. Let’s see this approach and use it to talk about why RVV has policies.

The last bit

Recall from the first part, that we want to compute the last bit in a mask, that is, the highest numbered element that is set (enabled) in the mask.

       | element number   |
       |                  |
       | 7 6 5 4 3 2 1 0  | vlast.m
----------+---------------+-----------
 mask  | 0 0 0 0 0 0 0 0  | -1
       | 0 0 0 0 0 0 0 1  |  0
       | 0 0 0 0 0 0 1 0  |  1
       | 0 0 0 0 0 1 1 0  |  2
       | 1 0 0 0 0 0 0 0  |  7
       | 1 0 0 0 0 1 0 0  |  7
       | 0 1 1 1 0 1 0 0  |  6
       | 1 1 1 1 0 1 0 0  |  7

An LLVM intrinsic

Due to my day job, I discovered that LLVM has an intrinsic function called llvm.experimental.vector.extract.last.active. This intrinsic receives a vector of values and a mask (a mask, in LLVM parlance, is a vector of i1, where i1 stands for integer of 1 bit). This function is extracts the element from the vector of values that corresponds to the highest numbered element that is active in the mask. That highest numbered element is exactly the last bit we want to compute! So I wondered how it does that. Turns out it uses yet another approach!

I will first present the approach using C intrinsics.

vlast-using-vid-vredmaxu.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int vlast_vid_vredmaxu(vbool32_t m, size_t vl) {
    // Compute the population count.
    uint64_t popc = __riscv_vcpop_m_b32(m, vl);
    // If no bit is set we're done: there is no last bit.
    if (!popc) return -1;
    // Initialize sequence to all zeroes.
    vuint32m1_t sequence = __riscv_vmv_v_x_u32m1(0, vl);
    // Compute the vector [0, 1, 2, 3, … vl-1] but using the mask
    sequence = __riscv_vid_v_u32m1_mu(m, sequence, vl);
    // Now compute the maximum element in the vector.
    vuint32m1_t maxvalue = __riscv_vredmaxu_vs_u32m1_u32m1(sequence, sequence, vl);
    // Extract the maximum element (stored in the first element)
    int result = __riscv_vmv_x_s_u32m1_u32(maxvalue);
    return result;
}

First we check if no bit is set, in that case the result is trivially -1. Otherwise, we first initialize a vector with all zeroes. Then we compute the sequence vector [0, 1, 2, 3, …, vl-1] using the vid instruction. Note, though, that this is using the mask of which we want to compute its vlast.m. So the result is [0, 1, 2, 3, …, vl-1] but where the corresponding elements of m is not set (i.e., cleared, a zero) we say the element is disabled by the mask and instead we set it to the corresponding element of sequence. Due to the earlier initialization, this means the disabled elements are going to be zero.

This behaviour is crucial for the next step, where we compute the maximum element of the whole vector. This is called a reduction operation because we reduce all the elements of the vector into a single value using some accumulator-like operation (e.g. an addition to get a sum or a multiplication to get a product). In this case we compute the maximum value of the whole vector. The result of reductions is another vector where only the first element is set (this is called a scalar vector in RVV terminology) so there is a final extraction for the result.

(If you wonder why __riscv_vredmaxu_vs_u32m1_u32m1 receives an additional operand, the reason is because the second operand is a vector whose first element acts as the initial value of the accumulation. In this case it is very convenient as sequence will always have a zero in its first element (even if the mask disabled that element), which acts as the neutral value of the maximum in this case. This is useful when chaining consecutive reductions. As already mentioned, the result of a reduction is precisely a vector where the reduction result is stored in the first element.)

This approach work because we have turned into zero all the elements in the vector [0, 1, 2, 3, …, vl-1] that were disabled by the mask. set. So the maximum index corresponds to the last bit index.

This approach relies on a feature of RVV called policies.

Policies in RVV

The RISC-V Vector Extension has several features that make it a bit unusual comparing to the existing SIMD extensions/subsets of more mainstream architectures.

One of them is the policies. There are two different policy types: tail policy and mask policy. Both can be agnostic or undisturbed. So there are four possible combinations:

  • tail agnostic, mask agnostic
  • tail agnostic, mask undisturbed
  • tail undisturbed, mask agnostic
  • tail undisturbed, mask undisturbed

The tail policy is applied to the elements that are past the vector length. The mask policy is applied to the elements that are disabled because of a masked instruction.

Vector length

Another feature of RVV is the vector length.

Most SIMD ISAs assume that a vector instruction will compute values for all the elements of the vector. For most instructions this is fine. Memory accesses are problematic: imagine your vectors can hold 8 elements, but your program only operates 5 contiguous elements in memory. Storing 8 elements is unsafe, so we need a way to limit the number of elements stored.

Most ISAs solve this using masks. Masks are very general and sometimes impractical. If we want to load only 5 elements we will need a mask [0, 0, 0, 1, 1, 1, 1, 1] and so a way to create this mask easily. Often ISAs provide specialised instructions for this case.

RVV solves this common case using a vector length. Without entering in too many details, before using vector instructions a vector configuration must be set. Among a number of things to set (including the tail and mask policy) there is a vector length value. In our example where vectors can hold 8 elements, we can set the vector length to 5 and all loads and stores will just operate with the first 5 elements. But not just memory accesses, any instruction will do that.

Now the question we need to address is: what is the value of the elements in the result that are past the vector length? In our example, what about the last 3 elements of the vector that are not operated because we are operating just 5?

This is where the tail policy enters into play. If the tail policy is undisturbed, the vector register used for the result will have those elements left untouched (undisturbed).

If the tail policy is agnostic, two implementations are possible: a behaviour like undisturbed is allowed or, alternatively, the bits of the elements of the tail get set to 1. This is so to discourage the use of those elements (because it is likely to be semantically incorrect).

Masked execution

A same observation is possible when using masks. Elements disabled by the mask can be left untouched, this is mask undisturbed, or optionally set all their bits to 1, this is mask agnostic.

Needless to say, that the mask policy is only relevant for masked instructions.

But why?

The reason is a bit complicated and is due to how RVV evolved. The first realistic version of RVV 0.7.1 (which was infamously implemented in commercial hardware) did not have policies. Well, it did, but they were kind of hardcoded and never given a name. We can call them tail zeroing, mask undisturbed.

In the current ratified RVV 1.0, tail agnostic may set all bits to 1 in the tail. Imagine that instead, we mandated (i.e., not optional) to set all the bits to 0. This was never given a name but that would be tail-zeroing. This was what RVV 0.7.1 did.

When RVV 0.8 was proposed, the policy was changed to be hardcoded to tail undisturbed, mask undisturbed (so, again, there was no policy as such). But this comes with drawbacks: an undisturbed policy forces us to say what are going to be the values in the result for elements past the vector length or that are disabled by the mask. And we often do not care about those. So an undisturbed policy means that our instructions have an extra operand which represents the values that are undisturbed. For some hardware implementations, in particular those using register renaming, this implies doing an extra copy of the destination register.

When RVV 0.9 was proposed, the policies as presented earlier were introduced. This meant that everything got multiplied by 4 in the ISA in exchange of flexibility. Software should prioritize agnostic policies where possible and hardware can opt-in to benefit from the agnostic policy. For instance, implementations that use register renaming can avoid copying the destination and if they remember the vector length of the operation, the rest of the bits are presented to the software as if they were 1s.

Most implementations definitely implement mask agnostic as mask undisturbed (but this should only be relevant for masked instructions). Some implementations have chosen to implement a proper tail agnostic policy.

One downside of the current policy mechanism is that at least one combination (tail undisturbed, mask agnostic) is of very little use to the software. Hardware implementors need to validate it though (even if it means mask undisturbed).

As already stated above, ideally, software should use agnostic policies when possible, but masked code will likely always use mask undisturbed because of its merge semantics.

Merge semantics

One interesting effect of mask undisturbed is that allows us to fold a merge operation. RVV has a vmerge.vvm instruction:

vmerge.vvm vd, vs2, vs1, v0  # vd[i] ← v0.mask[i] ? vs1[i] : vs2[i]

Earlier we saw we computed 0, 1, 2, 3, …, vl-1 but setting zeroes where the element is disabled by the mask. We could do this in two steps.

                      # 0 ≤ i < vl
vid.v v1              # Sets v1[i] ← i
vmv.v.i v2, 0         # Sets v2[i] ← 0
vmerge v3, v2, v1, v0 # The mask is always in v0
                      # if v0.mask[i] = 0 then v3[i] ← v2[i] else v3[i] ← v1[i]

but using a mask undisturbed policy, we can just use a masked instruction

# Something here sets the mask policy to undisturbed.
               # 0 ≤ i < vl
vmv.v.i v3, 0  # Sets v3[i] ← 0
vid.v v3, v0.t # if v0.mask[i] = 1 then v3[i] ← i else v3[i] ← v3[i]

(If you wonder why v0.t in this case, it is because the ISA reserves the possibility that in the future v0.f can be used as the mask operand. For v0.f the condition to update the result would be v0[i] = 0)

An extra operand, conceptually

Even if the hardware does not have to update the destination, now the result we want to have as part of the undisturbed policy is an operand of the instruction. And this is the reason why our intrinsic __riscv_vid_v_u32m1_mu (mu for mask undisturbed) needs an additional operand.