Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JVMCI] Libgraal can deadlock in blocking compilation mode #16

Closed
wants to merge 1 commit into from

Conversation

dougxc
Copy link
Member

@dougxc dougxc commented Sep 3, 2020

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 3, 2020

Welcome to the OpenJDK organization on GitHub!

This repository is currently a read-only git mirror of the official Mercurial repository (located at https://hg.openjdk.java.net/). As such, we are not currently accepting pull requests here. If you would like to contribute to the OpenJDK project, please see https://openjdk.java.net/contribute/ on how to proceed.

This pull request will be automatically closed.

@bridgekeeper bridgekeeper bot closed this Sep 3, 2020
fisk pushed a commit to fisk/jdk that referenced this pull request Oct 28, 2020
…ner_virtual_thread

8246039: SSLSocket HandshakeCompletedListeners are run on virtual threads
e1iu pushed a commit to e1iu/jdk that referenced this pull request Mar 10, 2021
Like scalar shift, vector shift could do nothing when shift count is
zero.

This patch implements the 'Identity' method for all kinds of vector
shift nodes to optimize out 'ShiftVCntNode 0', which is typically a
redundant 'mov' in final generated code like below:

```
        add     x17, x12, x14
        ldr     q16, [x17, openjdk#16]
        mov     v16.16b, v16.16b
        add     x14, x13, x14
        str     q16, [x14, openjdk#16]
```

With this patch, the code above could be optimized as below:

```
        add     x17, x12, x14
        ldr     q16, [x17, openjdk#16]
        add     x14, x13, x14
        str     q16, [x14, openjdk#16]
```

[TESTS]
compiler/vectorapi/TestVectorShiftImm.java, jdk/incubator/vector,
hotspot::tier1 passed without new failure.

Change-Id: I7657c0daaa5f758966936b9ede670c8b9ad94c48
cushon pushed a commit to cushon/jdk that referenced this pull request Apr 2, 2021
e1iu pushed a commit to e1iu/jdk that referenced this pull request Apr 7, 2021
The vector shift count was defined by two separate nodes(LShiftCntV and
RShiftCntV), which would prevent them from being shared when the shift
counts are the same.

```
public static void test_shiftv(int sh) {
    for (int i = 0; i < N; i+=1) {
        a0[i] = a1[i] << sh;
        b0[i] = b1[i] >> sh;
    }
}
```

Given the example above, by merging the same shift counts into one
node, they could be shared by shift nodes(RShiftV or LShiftV) like
below:

```
Before:
1184  LShiftCntV  === _  1189  [[ 1185  ... ]]
1190  RShiftCntV  === _  1189  [[ 1191  ... ]]
1185  LShiftVI  === _  1181  1184  [[ 1186 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]

After:
1190  ShiftCntV  === _  1189  [[ 1191 1204  ... ]]
1204  LShiftVI  === _  1211  1190  [[ 1203 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]
```

The final code could remove one redundant “dup”(scalar->vector),
with one register saved.

```
Before:
        dup     v16.16b, w12
        dup     v17.16b, w12
        ...
        ldr     q18, [x13, openjdk#16]
        sshl    v18.4s, v18.4s, v16.4s
        add     x18, x16, x12           ; iaload

        add     x4, x15, x12
        str     q18, [x4, openjdk#16]          ; iastore

        ldr     q18, [x18, openjdk#16]
        add     x12, x14, x12
        neg     v19.16b, v17.16b
        sshl    v18.4s, v18.4s, v19.4s
        str     q18, [x12, openjdk#16]         ; iastore

After:
        dup	v16.16b, w11
        ...
        ldr	q17, [x13, openjdk#16]
        sshl	v17.4s, v17.4s, v16.4s
        add	x2, x22, x11            ; iaload

        add	x4, x16, x11
        str	q17, [x4, openjdk#16]          ; iastore

        ldr	q17, [x2, openjdk#16]
        add	x11, x21, x11
        neg	v18.16b, v16.16b
        sshl	v17.4s, v17.4s, v18.4s
        str	q17, [x11, openjdk#16]         ; iastore

```

Change-Id: I047f3f32df9535d706a9920857d212610e8ce315
openjdk-notifier bot pushed a commit that referenced this pull request Oct 5, 2021
r18 should not be used as it is reserved as platform register. Linux is
fine with userspace using it, but Windows and also recently macOS (
openjdk/jdk11u-dev#301 (comment) )
are actually using it on the kernel side.

The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to
specify which registers to spill; fortunately this helper is only used
here:
https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404

I haven't seen causing this particular instance any issues in practice
_yet_, presumably because it looks hard to align the stars in order to
trigger a problem (between stp and ldp of r18 a transition to kernel
space must happen *and* the kernel needs to do something with r18). But
jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that
causes troubles as explained in the link above.

Output of `-XX:+PrintInterpreter` before this change:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000138809b00, 0x000000013880a280]  1920 bytes
--------------------------------------------------------------------------------
  0x0000000138809b00:   ldr x2, [x12, #16]
  0x0000000138809b04:   ldrh    w2, [x2, #44]
  0x0000000138809b08:   add x24, x20, x2, uxtx #3
  0x0000000138809b0c:   sub x24, x24, #0x8
[...]
  0x0000000138809fa4:   stp x16, x17, [sp, #128]
  0x0000000138809fa8:   stp x18, x19, [sp, #144]
  0x0000000138809fac:   stp x20, x21, [sp, #160]
[...]
  0x0000000138809fc0:   stp x30, xzr, [sp, #240]
  0x0000000138809fc4:   mov x0, x28
 ;; 0x10864ACCC
  0x0000000138809fc8:   mov x9, #0xaccc                 // #44236
  0x0000000138809fcc:   movk    x9, #0x864, lsl #16
  0x0000000138809fd0:   movk    x9, #0x1, lsl #32
  0x0000000138809fd4:   blr x9
  0x0000000138809fd8:   ldp x2, x3, [sp, #16]
[...]
  0x0000000138809ff4:   ldp x16, x17, [sp, #128]
  0x0000000138809ff8:   ldp x18, x19, [sp, #144]
  0x0000000138809ffc:   ldp x20, x21, [sp, #160]
```

After:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000108e4db00, 0x0000000108e4e280]  1920 bytes

--------------------------------------------------------------------------------
  0x0000000108e4db00:   ldr x2, [x12, #16]
  0x0000000108e4db04:   ldrh    w2, [x2, #44]
  0x0000000108e4db08:   add x24, x20, x2, uxtx #3
  0x0000000108e4db0c:   sub x24, x24, #0x8
[...]
  0x0000000108e4dfa4:   stp x16, x17, [sp, #128]
  0x0000000108e4dfa8:   stp x19, x20, [sp, #144]
  0x0000000108e4dfac:   stp x21, x22, [sp, #160]
[...]
  0x0000000108e4dfbc:   stp x29, x30, [sp, #224]
  0x0000000108e4dfc0:   mov x0, x28
 ;; 0x107E4A06C
  0x0000000108e4dfc4:   mov x9, #0xa06c                 // #41068
  0x0000000108e4dfc8:   movk    x9, #0x7e4, lsl #16
  0x0000000108e4dfcc:   movk    x9, #0x1, lsl #32
  0x0000000108e4dfd0:   blr x9
  0x0000000108e4dfd4:   ldp x2, x3, [sp, #16]
[...]
  0x0000000108e4dff0:   ldp x16, x17, [sp, #128]
  0x0000000108e4dff4:   ldp x19, x20, [sp, #144]
  0x0000000108e4dff8:   ldp x21, x22, [sp, #160]
[...]
```
lewurm added a commit to lewurm/openjdk that referenced this pull request Oct 6, 2021
Restore looks like this now:
```
  0x0000000106e4dfcc:   movk    x9, #0x5e4, lsl openjdk#16
  0x0000000106e4dfd0:   movk    x9, #0x1, lsl openjdk#32
  0x0000000106e4dfd4:   blr x9
  0x0000000106e4dfd8:   ldp x2, x3, [sp, openjdk#16]
  0x0000000106e4dfdc:   ldp x4, x5, [sp, openjdk#32]
  0x0000000106e4dfe0:   ldp x6, x7, [sp, openjdk#48]
  0x0000000106e4dfe4:   ldp x8, x9, [sp, openjdk#64]
  0x0000000106e4dfe8:   ldp x10, x11, [sp, openjdk#80]
  0x0000000106e4dfec:   ldp x12, x13, [sp, openjdk#96]
  0x0000000106e4dff0:   ldp x14, x15, [sp, openjdk#112]
  0x0000000106e4dff4:   ldp x16, x17, [sp, openjdk#128]
  0x0000000106e4dff8:   ldp x0, x1, [sp], openjdk#144
  0x0000000106e4dffc:   ldp xzr, x19, [sp], openjdk#16
  0x0000000106e4e000:   ldp x22, x23, [sp, openjdk#16]
  0x0000000106e4e004:   ldp x24, x25, [sp, openjdk#32]
  0x0000000106e4e008:   ldp x26, x27, [sp, openjdk#48]
  0x0000000106e4e00c:   ldp x28, x29, [sp, openjdk#64]
  0x0000000106e4e010:   ldp x30, xzr, [sp, openjdk#80]
  0x0000000106e4e014:   ldp x20, x21, [sp], openjdk#96
  0x0000000106e4e018:   ldur    x12, [x29, #-24]
  0x0000000106e4e01c:   ldr x22, [x12, openjdk#16]
  0x0000000106e4e020:   add x22, x22, #0x30
  0x0000000106e4e024:   ldr x8, [x28, openjdk#8]
```
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Dec 8, 2021
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms

1. Remove redundant instructions for abs with constant values

If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.

For example,
  int[] a
  for (int i = 0; i < SIZE; i++) {
    a[i] = Math.abs(-38);
  }

Before the patch, the generated code for the testcase above is:
...
  mov   w10, #0xffffffda
  cmp   w10, wzr
  cneg  w17, w10, lt
  dup   v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
  movi  v16.4s, #0x26
...

2. Remove redundant instructions for abs with char type

In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.

As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.

For example,

  char[] a;
  char[] b;
  for (int i = 0; i < SIZE; i++) {
    b[i] = (char) Math.abs(a[i]);
  }

Before the patch, the generated assembly code for the testcase
above is:

B15:
  add   x13, x21, w20, sxtw openjdk#1
  ldrh  w11, [x13, openjdk#16]
  cmp   w11, wzr
  cneg  w10, w11, lt
  strh  w10, [x13, openjdk#16]
  ldrh  w10, [x13, openjdk#18]
  cmp   w10, wzr
  cneg  w10, w10, lt
  strh  w10, [x13, openjdk#18]
  ...
  add   w20, w20, #0x1
  cmp   w20, w17
  b.lt  B15

After the patch, the generated assembly code is:
B15:
  sbfiz x18, x19, openjdk#1, openjdk#32
  add   x0, x14, x18
  ldr   q16, [x0, openjdk#16]
  add   x18, x21, x18
  str   q16, [x18, openjdk#16]
  ldr   q16, [x0, openjdk#32]
  str   q16, [x18, openjdk#32]
  ...
  add   w19, w19, #0x40
  cmp   w19, w17
  b.lt  B15

3. Convert some common abs operations to ideal forms

The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:

a) abs(0 - x) => abs(x)

Before the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  neg   w13, w13
  cmp   w13, wzr
  cneg  w14, w13, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

b) abs(abs(x))  => abs(x)

Before the patch:
  ...
  ldr   w12, [x12, openjdk#16]
  cmp   w12, wzr
  cneg  w12, w12, lt
  cmp   w12, wzr
  cneg  w12, w12, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Dec 8, 2021
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms

1. Remove redundant instructions for abs with constant values

If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.

For example,
  int[] a
  for (int i = 0; i < SIZE; i++) {
    a[i] = Math.abs(-38);
  }

Before the patch, the generated code for the testcase above is:
...
  mov   w10, #0xffffffda
  cmp   w10, wzr
  cneg  w17, w10, lt
  dup   v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
  movi  v16.4s, #0x26
...

2. Remove redundant instructions for abs with char type

In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.

As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.

For example,

  char[] a;
  char[] b;
  for (int i = 0; i < SIZE; i++) {
    b[i] = (char) Math.abs(a[i]);
  }

Before the patch, the generated assembly code for the testcase
above is:

B15:
  add   x13, x21, w20, sxtw openjdk#1
  ldrh  w11, [x13, openjdk#16]
  cmp   w11, wzr
  cneg  w10, w11, lt
  strh  w10, [x13, openjdk#16]
  ldrh  w10, [x13, openjdk#18]
  cmp   w10, wzr
  cneg  w10, w10, lt
  strh  w10, [x13, openjdk#18]
  ...
  add   w20, w20, #0x1
  cmp   w20, w17
  b.lt  B15

After the patch, the generated assembly code is:
B15:
  sbfiz x18, x19, openjdk#1, openjdk#32
  add   x0, x14, x18
  ldr   q16, [x0, openjdk#16]
  add   x18, x21, x18
  str   q16, [x18, openjdk#16]
  ldr   q16, [x0, openjdk#32]
  str   q16, [x18, openjdk#32]
  ...
  add   w19, w19, #0x40
  cmp   w19, w17
  b.lt  B15

3. Convert some common abs operations to ideal forms

The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:

a) abs(0 - x) => abs(x)

Before the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  neg   w13, w13
  cmp   w13, wzr
  cneg  w14, w13, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

b) abs(abs(x))  => abs(x)

Before the patch:
  ...
  ldr   w12, [x12, openjdk#16]
  cmp   w12, wzr
  cneg  w12, w12, lt
  cmp   w12, wzr
  cneg  w12, w12, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
shqking added a commit to shqking/jdk that referenced this pull request Mar 7, 2022
*** Implementation

In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a `neg` to given shift value followed by `sshl`
or `ushl` instruction.

For vector shift right, the vector shift count has two origins:
1) it can be duplicated from scalar variable/immediate(case-1),
2) it can be loaded directly from one vector(case-2).

This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.

In this patch,
1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
shift left and shift right respectively. Compared to vslcnt* rules, the
negate is conducted in vsrcnt*.
2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
vsra*_var and vsrl*_var rules to handle case-2. Note that
ShiftVNode::is_var_shift() can be used to distinguish case-1 from
case-2.
3) we add one assertion for the vs*_imm rules as we have done on
ARM32[3].
4) several style issues are resolved.

*** Example

Take function `rShiftInt()` in the newly added micro benchmark
VectorShiftRight.java as an example.

```
public void rShiftInt() {
    for (int i = 0; i < SIZE; i++) {
        intsB[i] = intsA[i] >> count;
    }
}
```

Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that `neg` is conducted in the
loop body.

```
0x0000ffff89057a64:   dup     v16.16b, w13              <-- dup
0x0000ffff89057a68:   mov     w12, #0x7d00                    // #32000
0x0000ffff89057a6c:   sub     w13, w2, w10
0x0000ffff89057a70:   cmp     w2, w10
0x0000ffff89057a74:   csel    w13, wzr, w13, lt
0x0000ffff89057a78:   mov     w8, #0x7d00                     // #32000
0x0000ffff89057a7c:   cmp     w13, w8
0x0000ffff89057a80:   csel    w13, w12, w13, hi
0x0000ffff89057a84:   add     w14, w13, w10
0x0000ffff89057a88:   nop
0x0000ffff89057a8c:   nop
0x0000ffff89057a90:   sbfiz   x13, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff89057a94:   add     x15, x17, x13
0x0000ffff89057a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff89057a9c:   add     x13, x0, x13
0x0000ffff89057aa0:   neg     v18.16b, v16.16b          <-- neg
0x0000ffff89057aa4:   sshl    v17.4s, v17.4s, v18.4s    <-- shift right
0x0000ffff89057aa8:   str     q17, [x13,openjdk#16]
0x0000ffff89057aac:   ...
0x0000ffff89057b1c:   add     w10, w10, #0x20
0x0000ffff89057b20:   cmp     w10, w14
0x0000ffff89057b24:   b.lt    0x0000ffff89057a90        <-- loop end
```

Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.

```
0x0000ffff8d053a68:   neg     w14, w13                  <---- neg
0x0000ffff8d053a6c:   dup     v16.16b, w14              <---- dup
0x0000ffff8d053a70:   sub     w14, w2, w10
0x0000ffff8d053a74:   cmp     w2, w10
0x0000ffff8d053a78:   csel    w14, wzr, w14, lt
0x0000ffff8d053a7c:   mov     w8, #0x7d00                     // #32000
0x0000ffff8d053a80:   cmp     w14, w8
0x0000ffff8d053a84:   csel    w14, w12, w14, hi
0x0000ffff8d053a88:   add     w13, w14, w10
0x0000ffff8d053a8c:   nop
0x0000ffff8d053a90:   sbfiz   x14, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff8d053a94:   add     x15, x17, x14
0x0000ffff8d053a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff8d053a9c:   sshl    v17.4s, v17.4s, v16.4s    <-- shift right
0x0000ffff8d053aa0:   add     x14, x0, x14
0x0000ffff8d053aa4:   str     q17, [x14,openjdk#16]
0x0000ffff8d053aa8:   ...
0x0000ffff8d053afc:   add     w10, w10, #0x20
0x0000ffff8d053b00:   cmp     w10, w13
0x0000ffff8d053b04:   b.lt    0x0000ffff8d053a90        <-- loop end
```

*** Testing

Tier1~3 tests passed on Linux/AArch64 platform.

*** Performance Evaluation

- Auto-vectorization

One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.

The following table shows the result. Column `Score-1` shows the score
before we apply current patch, and column `Score-2` shows the score when
we apply current patch.

We witness about 30% ~ 53% improvement on microbenchmarks.

```
Benchmark                      Units    Score-1    Score-2
VectorShiftRight.rShiftByte   ops/ms  10601.980  13816.353
VectorShiftRight.rShiftInt    ops/ms   3592.831   5502.941
VectorShiftRight.rShiftLong   ops/ms   1584.012   2425.247
VectorShiftRight.rShiftShort  ops/ms   6643.414   9728.762
VectorShiftRight.urShiftByte  ops/ms   2066.965   2048.336 (*)
VectorShiftRight.urShiftChar  ops/ms   6660.805   9728.478
VectorShiftRight.urShiftInt   ops/ms   3592.909   5514.928
VectorShiftRight.urShiftLong  ops/ms   1583.995   2422.991

*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
```

- VectorAPI

Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
`Score-1` and `Score-2` show the scores before and after applying
current patch.

```
Benchmark                  Units    Score-1    Score-2
Byte128Vector.LSHL        ops/ms  10867.666  10873.993
Byte128Vector.LSHLShift   ops/ms  10945.729  10945.741
Byte128Vector.LSHR        ops/ms   8629.305   8629.343
Byte128Vector.LSHRShift   ops/ms   8245.864  10303.521   <--
Byte128Vector.ASHR        ops/ms   8619.691   8629.438
Byte128Vector.ASHRShift   ops/ms   8245.860  10305.027   <--
Int128Vector.LSHL         ops/ms   3104.213   3103.702
Int128Vector.LSHLShift    ops/ms   3114.354   3114.371
Int128Vector.LSHR         ops/ms   2380.717   2380.693
Int128Vector.LSHRShift    ops/ms   2312.871   2992.377   <--
Int128Vector.ASHR         ops/ms   2380.668   2380.647
Int128Vector.ASHRShift    ops/ms   2312.894   2992.332   <--
Long128Vector.LSHL        ops/ms   1586.907   1587.591
Long128Vector.LSHLShift   ops/ms   1589.469   1589.540
Long128Vector.LSHR        ops/ms   1209.754   1209.687
Long128Vector.LSHRShift   ops/ms   1174.718   1527.502   <--
Long128Vector.ASHR        ops/ms   1209.713   1209.669
Long128Vector.ASHRShift   ops/ms   1174.712   1527.174   <--
Short128Vector.LSHL       ops/ms   5945.542   5943.770
Short128Vector.LSHLShift  ops/ms   5984.743   5984.640
Short128Vector.LSHR       ops/ms   4613.378   4613.577
Short128Vector.LSHRShift  ops/ms   4486.023   5746.466   <--
Short128Vector.ASHR       ops/ms   4613.389   4613.478
Short128Vector.ASHRShift  ops/ms   4486.019   5746.368   <--
```

1) For logical shift left(LSHL and LSHLShift), and shift right with
variable vector shift count(LSHR and ASHR) cases, we didn't find much
changes, which is expected.

2) For shift right with scalar shift count(LSHRShift and ASHRShift)
case, about 25% ~ 30% improvement can be observed, and this benefit is
introduced by current patch.

[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] openjdk#1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Mar 14, 2022
After JDK-8275317, C2's SLP vectorizer has supported type conversion
between the same data size. We can also support conversions between
different data sizes like:
int <-> double
float <-> long
int <-> long
float <-> double

A typical test case:

int[] a;
double[] b;
for (int i = start; i < limit; i++) {
    b[i] = (double) a[i];
}

Our expected OptoAssembly code for one iteration is like below:

add R12, R2, R11, LShiftL openjdk#2
vector_load   V16,[R12, openjdk#16]
vectorcast_i2d  V16, V16  # convert I to D vector
add R11, R1, R11, LShiftL openjdk#3	# ptr
add R13, R11, openjdk#16	# ptr
vector_store [R13], V16

To enable the vectorization, the patch solves the following problems
in the SLP.

There are three main operations in the case above, LoadI, ConvI2D and
StoreD. Assuming that the vector length is 128 bits, how many scalar
nodes should be packed together to a vector? If we decide it
separately for each operation node, like what we did before the patch
in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
in a vector node sequence, like loading 4 elements to a vector, then
typecasting 2 elements and lastly storing these 2 elements, they become
invalid. As a result, we should look through the whole def-use chain
and then pick up the minimum of these element sizes, like function
SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
generate valid vector node sequence, like loading 2 elements,
converting the 2 elements to another type and storing the 2 elements
with new type.

After this, LoadI nodes don't make full use of the whole vector and
only occupy part of it. So we adapt the code in
SuperWord::get_vw_bytes_special() to the situation.

In SLP, we calculate a kind of alignment as position trace for each
scalar node in the whole vector. In this case, the alignments for 2
LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
mark that this node is the second node in the whole vector, while the
difference between 4 and 8 are just because of their own data sizes. In
this situation, we should try to remove the impact caused by different
data size in SLP. For example, in the stage of
SuperWord::extend_packlist(), while determining if it's potential to
pack a pair of def nodes in the function SuperWord::follow_use_defs(),
we remove the side effect of different data size by transforming the
target alignment from the use node. Because we believe that, assuming
that the vector length is 512 bits, if the ConvI2D use nodes have
alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
these two LoadI nodes should be packed as a pair as well.

Similarly, when determining if the vectorization is profitable, type
conversion between different data size takes a type of one size and
produces a type of another size, hence the special checks on alignment
and size should be applied, like what we do in SuperWord::is_vector_use.

After solving these problems, we successfully implemented the
vectorization of type conversion between different data sizes.

Here is the test data on NEON:

Before the patch:
Benchmark              (length)  Mode  Cnt    Score   Error  Units
  VectorLoop.convertD2F       523  avgt   15  216.431 ± 0.131  ns/op
  VectorLoop.convertD2I       523  avgt   15  220.522 ± 0.311  ns/op
  VectorLoop.convertF2D       523  avgt   15  217.034 ± 0.292  ns/op
  VectorLoop.convertF2L       523  avgt   15  231.634 ± 1.881  ns/op
  VectorLoop.convertI2D       523  avgt   15  229.538 ± 0.095  ns/op
  VectorLoop.convertI2L       523  avgt   15  214.822 ± 0.131  ns/op
  VectorLoop.convertL2F       523  avgt   15  230.188 ± 0.217  ns/op
  VectorLoop.convertL2I       523  avgt   15  162.234 ± 0.235  ns/op

After the patch:
Benchmark              (length)  Mode  Cnt    Score    Error  Units
  VectorLoop.convertD2F       523  avgt   15  124.352 ±  1.079  ns/op
  VectorLoop.convertD2I       523  avgt   15  557.388 ±  8.166  ns/op
  VectorLoop.convertF2D       523  avgt   15  118.082 ±  4.026  ns/op
  VectorLoop.convertF2L       523  avgt   15  225.810 ± 11.180  ns/op
  VectorLoop.convertI2D       523  avgt   15  166.247 ±  0.120  ns/op
  VectorLoop.convertI2L       523  avgt   15  119.699 ±  2.925  ns/op
  VectorLoop.convertL2F       523  avgt   15  220.847 ±  0.053  ns/op
  VectorLoop.convertL2I       523  avgt   15  122.339 ±  2.738  ns/op

perf data on X86:
Before the patch:
Benchmark              (length)  Mode  Cnt    Score   Error  Units
  VectorLoop.convertD2F       523  avgt   15  279.466 ± 0.069  ns/op
  VectorLoop.convertD2I       523  avgt   15  551.009 ± 7.459  ns/op
  VectorLoop.convertF2D       523  avgt   15  276.066 ± 0.117  ns/op
  VectorLoop.convertF2L       523  avgt   15  545.108 ± 5.697  ns/op
  VectorLoop.convertI2D       523  avgt   15  745.303 ± 0.185  ns/op
  VectorLoop.convertI2L       523  avgt   15  260.878 ± 0.044  ns/op
  VectorLoop.convertL2F       523  avgt   15  502.016 ± 0.172  ns/op
  VectorLoop.convertL2I       523  avgt   15  261.654 ± 3.326  ns/op

After the patch:
Benchmark              (length)  Mode  Cnt    Score   Error  Units
  VectorLoop.convertD2F       523  avgt   15  106.975 ± 0.045  ns/op
  VectorLoop.convertD2I       523  avgt   15  546.866 ± 9.287  ns/op
  VectorLoop.convertF2D       523  avgt   15   82.414 ± 0.340  ns/op
  VectorLoop.convertF2L       523  avgt   15  542.235 ± 2.785  ns/op
  VectorLoop.convertI2D       523  avgt   15   92.966 ± 1.400  ns/op
  VectorLoop.convertI2L       523  avgt   15   79.960 ± 0.528  ns/op
  VectorLoop.convertL2F       523  avgt   15  504.712 ± 4.794  ns/op
  VectorLoop.convertL2I       523  avgt   15  129.753 ± 0.094  ns/op

perf data on AVX512:
Before the patch:
Benchmark              (length)  Mode  Cnt    Score   Error  Units
  VectorLoop.convertD2F       523  avgt   15  282.984 ± 4.022  ns/op
  VectorLoop.convertD2I       523  avgt   15  543.080 ± 3.873  ns/op
  VectorLoop.convertF2D       523  avgt   15  273.950 ± 0.131  ns/op
  VectorLoop.convertF2L       523  avgt   15  539.568 ± 2.747  ns/op
  VectorLoop.convertI2D       523  avgt   15  745.238 ± 0.069  ns/op
  VectorLoop.convertI2L       523  avgt   15  260.935 ± 0.169  ns/op
  VectorLoop.convertL2F       523  avgt   15  501.870 ± 0.359  ns/op
  VectorLoop.convertL2I       523  avgt   15  257.508 ± 0.174  ns/op

After the patch:
Benchmark              (length)  Mode  Cnt    Score   Error  Units
  VectorLoop.convertD2F       523  avgt   15   76.687 ± 0.530  ns/op
  VectorLoop.convertD2I       523  avgt   15  545.408 ± 4.657  ns/op
  VectorLoop.convertF2D       523  avgt   15  273.935 ± 0.099  ns/op
  VectorLoop.convertF2L       523  avgt   15  540.534 ± 3.032  ns/op
  VectorLoop.convertI2D       523  avgt   15  745.234 ± 0.053  ns/op
  VectorLoop.convertI2L       523  avgt   15  260.865 ± 0.104  ns/op
  VectorLoop.convertL2F       523  avgt   15   63.834 ± 4.777  ns/op
  VectorLoop.convertL2I       523  avgt   15   48.183 ± 0.990  ns/op

Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
e1iu pushed a commit to e1iu/jdk that referenced this pull request Mar 24, 2022
This patch fixes the wrong matching rule of replicate2L_zero. It was
matched "ReplicateI" by mistake so that long immediates(not only zero)
had to be moved to register first and matched to replicate2L finally. To
fix this trivial bug, this patch fixes the typo and extends the rule of
replicate2L_zero to replicate2L_imm, which now supports all possible
long immediate values.

The final code changes are shown as below:

replicate2L_imm:

        mov   x13, #0xff
        movk  x13, #0xff, lsl openjdk#16
        movk  x13, #0xff, lsl openjdk#32
        dup   v16.2d, x13

        =>

        movi  v16.2d, #0xff00ff00ff

[Test]
test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
passed without failure.

Change-Id: Ieac92820dea560239a968de3d7430003f01726bd
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Mar 28, 2022
```
public short[] vectorUnsignedShiftRight(short[] shorts) {
    short[] res = new short[SIZE];
    for (int i = 0; i < SIZE; i++) {
        res[i] = (short) (shorts[i] >>> 3);
    }
    return res;
}
```
In C2's SLP, vectorization of unsigned shift right on signed
subword types (byte/short) like the case above is intentionally
disabled[1]. Because the vector unsigned shift on signed
subword types behaves differently from the Java spec. It's
worthy to vectorize more cases in quite low cost. Also,
unsigned shift right on signed subword is not uncommon and we
may find similar cases in Lucene benchmark[2].

Taking unsigned right shift on short type as an example,

Short:
    | <- 16 bits  -> |  <- 16 bits ->  |
    | 1 1 1 ... 1  1 |      data       |

when the shift amount is a constant not greater than the number
of sign extended bits, 16 higher bits for short type shown like
above, the unsigned shift on signed subword types can be
transformed into a signed shift and hence becomes vectorizable.
Here is the transformation:

For T_SHORT (shift <= 16):
  src    RShiftCntV shift          src    RShiftCntV shift
   \      /                  ==>    \       /
   URShiftVS                         RShiftVS

This patch does the transformation in SuperWord::implemented() and
SuperWord::output(). It helps vectorize the short cases above. We
can handle unsigned right shift on byte type in a similar way. The
generated assembly code for one iteration on aarch64 is like:
```
...
sbfiz   x13, x10, openjdk#1, openjdk#32
add     x15, x11, x13
ldr     q16, [x15, openjdk#16]
sshr    v16.8h, v16.8h, openjdk#3
add     x13, x17, x13
str     q16, [x13, openjdk#16]
...
```

Here is the performance data for micro-benchmark before and after
this patch on both AArch64 and x64 machines. We can observe about
~80% improvement with this patch.

The perf data on AArch64:
Before the patch:
Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op

after the patch:
Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op

The perf data on X86:
Before the patch:
Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op

After the patch:
Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op

[1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
[2] https://github.com/jpountz/decode-128-ints-benchmark/

Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
e1iu pushed a commit to e1iu/jdk that referenced this pull request Mar 29, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu pushed a commit to e1iu/jdk that referenced this pull request Apr 21, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
franferrax added a commit to franferrax/jdk that referenced this pull request Aug 11, 2022
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Aug 17, 2022
After JDK-8283091, the loop below can be vectorized partially.
Statement 1 can be vectorized but statement 2 can't.
```
// int[] iArr; long[] lArrFld; int i1,i2;
for (i1 = 6; i1 < 227; i1++) {
  iArr[i1] += lArrFld[i1]++; // statement 1
  iArr[i1 + 1] -= (i2++); // statement 2
}
```

But we got incorrect results because the vector packs of iArr are
scheduled incorrectly like:
```
...
load_vector XMM1,[R8 + openjdk#16 + R11 << openjdk#2]
movl    RDI, [R8 + openjdk#20 + R11 << openjdk#2] # int
load_vector XMM2,[R9 + openjdk#8 + R11 << openjdk#3]
subl    RDI, R11    # int
vpaddq  XMM3,XMM2,XMM0  ! add packedL
store_vector [R9 + openjdk#8 + R11 << openjdk#3],XMM3
vector_cast_l2x  XMM2,XMM2  !
vpaddd  XMM1,XMM2,XMM1  ! add packedI
addl    RDI, openjdk#228   # int
movl    [R8 + openjdk#20 + R11 << openjdk#2], RDI # int
movl    RBX, [R8 + openjdk#24 + R11 << openjdk#2] # int
subl    RBX, R11    # int
addl    RBX, openjdk#227   # int
movl    [R8 + openjdk#24 + R11 << openjdk#2], RBX # int
...
movl    RBX, [R8 + openjdk#40 + R11 << openjdk#2] # int
subl    RBX, R11    # int
addl    RBX, openjdk#223   # int
movl    [R8 + openjdk#40 + R11 << openjdk#2], RBX # int
movl    RDI, [R8 + openjdk#44 + R11 << openjdk#2] # int
subl    RDI, R11    # int
addl    RDI, openjdk#222   # int
movl    [R8 + openjdk#44 + R11 << openjdk#2], RDI # int
store_vector [R8 + openjdk#16 + R11 << openjdk#2],XMM1
...
```
simplified as:
```
load_vector iArr in statement 1
unvectorized loads/stores in statement 2
store_vector iArr in statement 1
```
We cannot pick the memory state from the first load for LoadI pack
here, as the LoadI vector operation must load the new values in memory
after iArr writes 'iArr[i1 + 1] - (i2++)' to 'iArr[i1 + 1]'(statement 2).
We must take the memory state of the last load where we have assigned
new values ('iArr[i1 + 1] - (i2++)') to the iArr array.

In JDK-8240281, we picked the memory state of the first load. Different
from the scenario in JDK-8240281, the store, which is dependent on an
earlier load here, is in a pack to be scheduled and the LoadI pack
depends on the last_mem. As designed[2], to schedule the StoreI pack,
all memory operations in another single pack should be moved in the same
direction. We know that the store in the pack depends on one of loads in
the LoadI pack, so the LoadI pack should be scheduled before the StoreI
pack. And the LoadI pack depends on the last_mem, so the last_mem must
be scheduled before the LoadI pack and also before the store pack.
Therefore, we need to take the memory state of the last load for the
LoadI pack here.

To fix it, the pack adds additional checks while picking the memory state
of the first load. When the store locates in a pack and the load pack
relies on the last_mem, we shouldn't choose the memory state of the
first load but choose the memory state of the last load.

[1]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2380
[2]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2232

Jira: ENTLLT-5482
Change-Id: I341d10b91957b60a1b4aff8116723e54083a5fb8
CustomizedGitHooks: yes
Bhavana-Kilambi added a commit to Bhavana-Kilambi/jdk that referenced this pull request Sep 5, 2022
…nodes

Recently we found that the rotate left/right benchmarks with vectorapi
emit a redundant "and" instruction on both aarch64 and x86_64 machines
which can be done away with.  For example - and(and(a, b), b) generates
two "and" instructions which can be reduced to a single "and" operation-
and(a, b) since "and" (and "or") operations are commutative and
idempotent in nature.  This can help improve performance for all those
workloads which have multiple "and"/"or" operations with the same value
by reducing them to fewer "and"/"or" operations accordingly.

This patch adds the following transformations for vector logical
operations - AndV and OrV :

(OpV (OpV a b) b) => (OpV a b)
(OpV (OpV a b) a) => (OpV a b)
(OpV (OpV a b m1) b m1) => (OpV a b m1)
(OpV (OpV a b m1) a m1) => (OpV a b m1)
(OpV a (OpV a b)) => (OpV a b)
(OpV b (OpV a b)) => (OpV a b)
(OpV a (OpV a b m) m) => (OpV a b m)
where Op = "And", "Or"

Links for benchmarks tested are given below :-
https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L728
https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L764
https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L728
https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L764

Before this patch, the disassembly for one these testcases
(IntMaxVector.ROR) for Neon is shown below :
  ldr   q16, [x12, openjdk#16]
  and   v16.16b, v16.16b, v20.16b
  and   v16.16b, v16.16b, v20.16b
  add   x12, x16, x11
  sub   v17.4s, v21.4s, v16.4s
  ldr   q18, [x12, openjdk#16]
  sshl  v17.4s, v18.4s, v17.4s
  add   x11, x18, x11
  neg   v19.16b, v16.16b
  ushl  v19.4s, v18.4s, v19.4s
  orr   v16.16b, v17.16b, v19.16b
  str   q16, [x11, openjdk#16]

After this patch, the disassembly for the same testcase above is shown
below :
  ldr   q16, [x12, openjdk#16]
  and   v16.16b, v16.16b, v20.16b
  add   x12, x16, x11
  sub   v17.4s, v21.4s, v16.4s
  ldr   q18, [x12, openjdk#16]
  sshl  v17.4s, v18.4s, v17.4s
  add   x11, x18, x11
  neg   v19.16b, v16.16b
  ushl  v19.4s, v18.4s, v19.4s
  orr   v16.16b, v17.16b, v19.16b
  str   q16, [x11, openjdk#16]

The other tests also emit an extra "and" instruction as shown above for
the vector ROR/ROL operations.

Below are the performance results for the vectorapi rotate tests (tests
given in the links above) with this patch on aarch64 and x86_64 machines
(for int and long types) -
Benchmark                aarch64   x86_64
IntMaxVector.ROL         25.57%    26.09%
IntMaxVector.ROR         23.75%    24.15%
LongMaxVector.ROL        28.91%    28.51%
LongMaxVector.ROR        16.51%    29.11%

The percentage indicates the percent gain/improvement in performance
(ops/ms) with this patch over the master build without this patch. The
machine descriptions are given below -
aarch64 - 128-bit aarch64 machine
x86_64  - 256-bit x86 machine
openjdk-notifier bot pushed a commit that referenced this pull request Nov 9, 2022
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Nov 29, 2022
…erOfTrailingZeros/numberOfLeadingZeros()`

Background:

Java API[1] for `Long.bitCount/numberOfTrailingZeros/
numberOfLeadingZeros()` returns int type, while Vector API[2]
for them returns long type. Currently, to support
auto-vectorization of Java API and Vector API at the same time,
some vector platforms, namely aarch64 and x86, provides two types
of vector nodes taking long type: One produces long vector type
for vector API, and the other one produces int vector type by
casting long-type result from the first one.

We can move the casting work for auto-vectorization of Java API
to the mid-end so that we can unify the vector implementation in
the backend, reducing extra code. The patch does the refactoring
and also fixes several issues below.

1. Refine the auto-vectorization of
`Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`

In the patch, during the stage of generating vector node for the
candidate pack, to implement the complete behavior of these
Java APIs, superword will make two consecutive vector nodes:
the first one, the same as Vector API, does the real execution
to produce long-type result, and the second one casts the result
to int vector type.

For those platforms, which have supported correctly vectorizing
these java APIs before, the patch has no real impact on final
generated assembly code and, consequently, has no performance
regression.

2. Fix the IR check failure of
`compiler/vectorization/TestPopCountVectorLong.java` on
128-bit sve platform

These Java APIs take a long type and produce an int type, like
conversion nodes between different data sizes do. In superword,
the alignment of their input nodes is different from their own.
It results in that these APIs can't be vectorized when
`-XX:MaxVectorSize=16`. So, the IR check for vector nodes in
`compiler/vectorization/TestPopCountVectorLong.java` would fail.
To fix the issue of alignment, the patch corrects their related
alignment, just like it did for conversion nodes between
different data sizes. After the patch, these Java APIs can be
vectorized on 128-bit platforms, as long as the auto-
vectorization is profitable.

3. Fix the incorrect vectorization of
`numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64
platforms with more than 128 bits

Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can
be vectorized on sve platforms when `-XX:MaxVectorSize=32` or
`-XX:MaxVectorSize=64` even before the patch, aarch64 backend
didn't provide special vector implementation for Java API and
thus the generated code is not correct, like:
```
LOOP:
  sxtw  x13, w12
  add   x14, x15, x13, uxtx openjdk#3
  add   x17, x14, #0x10
  ld1d  {z16.d}, p7/z, [x17]
  // Incorrectly use integer rbit/clz insn for long type vector
 *rbit  z16.s, p7/m, z16.s
 *clz   z16.s, p7/m, z16.s
  add   x13, x16, x13, uxtx openjdk#2
  str   q16, [x13, openjdk#16]
  ...
  add   w12, w12, #0x20
  cmp   w12, w3
  b.lt  LOOP
```

It causes a runtime failure of the testcase
`compiler/vectorization/TestNumberOfContinuousZeros.java` added
in the patch. After the refactoring, the testcase can pass and
the code is corrected:
```
LOOP:
  sxtw  x13, w12
  add   x14, x15, x13, uxtx openjdk#3
  add   x17, x14, #0x10
  ld1d  {z16.d}, p7/z, [x17]
  // Compute with long vector type and convert to int vector type
 *rbit  z16.d, p7/m, z16.d
 *clz   z16.d, p7/m, z16.d
 *mov   z24.d, #0
 *uzp1  z25.s, z16.s, z24.s
  add   x13, x16, x13, uxtx openjdk#2
  str   q25, [x13, openjdk#16]
  ...
  add   w12, w12, #0x20
  cmp   w12, w3
  b.lt  LOOP
```

4. Fix an assertion failure on x86 avx2 platform

Before, on x86 avx2 platform, there is an assertion failure when
C2 tries to vectorize the loops like:
```
//  long[] ia;
//  int[] ic;
    for (int i = 0; i < LENGTH; ++i) {
      ic[i] = Long.numberOfLeadingZeros(ia[i]);
    }
```

X86 backend supports vectorizing `numberOfLeadingZeros()` on
avx2 platform, but it uses `evpmovqd()` to do casting for
`CountLeadingZerosV`[3], which can only be used when
`UseAVX > 2`[4]. After the refactoring, the failure can be fixed
naturally.

Tier 1~3 passed with no new failures on Linux AArch64/X86 platform.

[1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long)
    https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long)
    https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long)
[2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687
[3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418
[4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239
stefank pushed a commit to stefank/jdk that referenced this pull request Mar 24, 2023
…dk#16)

* Only use conditional far branch in copy_memory for zgc

* Remove unused code
caojoshua pushed a commit to caojoshua/jdk that referenced this pull request Mar 29, 2023
gnu-andrew pushed a commit to gnu-andrew/jdk that referenced this pull request Apr 4, 2023
robehn pushed a commit to robehn/jdk that referenced this pull request Aug 15, 2023
gnu-andrew pushed a commit to gnu-andrew/jdk that referenced this pull request Aug 18, 2023
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Nov 21, 2023
…ng into ldp/stp on AArch64

Macro-assembler on aarch64 can merge adjacent loads or stores
into ldp/stp[1]. For example, it can merge:
```
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20]
```
into
```
stp     w20, w10, [sp, openjdk#16]
```

But C2 may generate a sequence like:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     x19, [sp, openjdk#24] <---
str     w10, [sp, openjdk#20] <--- Before sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```
We can't do any merging for non-adjacent loads or stores.

The patch is to sort the spilling or unspilling sequence in
the order of offset during instruction scheduling and bundling
phase. After that, we can get a new sequence:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20] <---
str     x19, [sp, openjdk#24] <--- After sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

Then macro-assembler can do ld/st merging:
```
str     x21, [sp, openjdk#8]
stp     w20, w10, [sp, openjdk#16] <--- Merged
str     x19, [sp, openjdk#24]
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

To justify the patch, we run `HelloWorld.java`
```
public class HelloWorld {
    public static void main(String [] args) {
        System.out.println("Hello World!");
    }
}
```
with `java -Xcomp -XX:-TieredCompilation HelloWorld`.

Before the patch, macro-assembler can do ld/st merging for
3688 times. After the patch, the number of ld/st merging
increases to 3871 times, by ~5 %.

Tested tier1~3 on x86 and AArch64.

[1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
openjdk-notifier bot pushed a commit that referenced this pull request Apr 11, 2024
Add framework for other platforms.  Moved fill_to_memory_atomic back to the .cpp from the .hpp in order to get 32-bit fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant