You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RVV provides the __riscv_vset_v_*_* and __riscv_vget_v_*_* intrinsics for not only tuple types but also for vector groups since v0.11, for example:
vint16m4_t__riscv_vset_v_i16m1_i16m4(vint16m4_tdest, size_tindex, vint16m1_tvalue);
// __riscv_vset_v_i16m1_i16m4(dest, 2, value) copies the register `value` to the third register in the dest vector group
They are usually translated to whole register move instructions (e.g., vmv1r) and is usually efficient on most microarchitectures, and could potentially be eliminated by the register allocator when the compilers are getting more advanced.
These operations are useful when implementing concat operators like ConcatUpperLower when LMUL is not fractional. For example, the current ConcatUpperLower is implemented as follows
template <classD, classV>
HWY_API V ConcatUpperLower(D d, const V hi, const V lo) {
constsize_t half = Lanes(d) / 2;
const V hi_down = detail::SlideDown(hi, half);
returndetail::SlideUp(lo, hi_down, half);
}
For V=vuint8m2_t, each of the two slide operations will take 4 cycles on x280. If we implement it with vget and vset, we can do
RVV provides the
__riscv_vset_v_*_*
and__riscv_vget_v_*_*
intrinsics for not only tuple types but also for vector groups since v0.11, for example:They are usually translated to whole register move instructions (e.g.,
vmv1r
) and is usually efficient on most microarchitectures, and could potentially be eliminated by the register allocator when the compilers are getting more advanced.These operations are useful when implementing concat operators like
ConcatUpperLower
when LMUL is not fractional. For example, the currentConcatUpperLower
is implemented as followsFor
V=vuint8m2_t
, each of the two slide operations will take 4 cycles on x280. If we implement it withvget
andvset
, we can doThis will be translated to a program that takes 2 cycles by clang (trunk version).
However, I have no idea on how to deal with all the macros to add the operations to highway. Any idea or instructions on this?
The text was updated successfully, but these errors were encountered: