Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8339771: RISC-V: Reduce icache flushes #20913

Closed
wants to merge 6 commits into from
Closed

Conversation

robehn
Copy link
Contributor

@robehn robehn commented Sep 9, 2024

Hey, please consider,

All code which is offline (behind a barrier) do not need global icache flushes.
As we can instead in slow path locally (thread and hart) emit fence.i.
But if we were to be context switch do a hart which have not had fence.i emitted we can still fetch stale instructions.
To handle this case new now have kernel support:
https://docs.kernel.org/arch/riscv/cmodx.html

It's not perfect as we will be emitting fence.i on any context switch for any thread with this patch, even if that thread do not execute on code heap (non attached native thread), and even if there was no changes to code heap.
But this is in many cases much faster as the icache flush global IPI is very intrusive.
Particular cases are running a concurrent gc with small head room.
In such scenario I measured 15% increased throughput on VF2.
A large CPU or less head room (faster GC cycles) will yield even more performance boost.

Note that this requires 6.10 kernel.

I'm running VF2 with 6.11-rc3 kernel and this passed tier1-3. (With setting on)

Later we probably want this default on, but as it's hard to test I'll leave default off.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8339771: RISC-V: Reduce icache flushes (Enhancement - P5)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20913/head:pull/20913
$ git checkout pull/20913

Update a local copy of the PR:
$ git checkout pull/20913
$ git pull https://git.openjdk.org/jdk.git pull/20913/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20913

View PR using the GUI difftool:
$ git pr show -t 20913

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20913.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 9, 2024

👋 Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 9, 2024

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8339771: RISC-V: Reduce icache flushes

Reviewed-by: fyang, mli, luhenry

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Sep 9, 2024

@robehn The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 9, 2024
@mlbridge
Copy link

mlbridge bot commented Sep 9, 2024

Webrevs

@robehn
Copy link
Contributor Author

robehn commented Sep 12, 2024

Thank you @luhenry !

@RealFYang
Copy link
Member

This is interesting. Thanks for trying this out.
BTW: Will this reflect on performance numbers of popular benchmarks workloads?

@RealFYang
Copy link
Member

RealFYang commented Sep 17, 2024

I am trying to compare this with #9770 which removes unnecessary fence.i used in user space. I was thinking that this PR might add those fence.i back under UseCtxFencei. But seems there are differences there. Previously, we emitted fence.i in patch_callers_callsite() and MacroAssembler::emit_static_call_stub(), which looks similar like what aarch64 does (usage of isb). But this PR adds fence.i in different places like in generate_method_entry_barrier() and ZBarrierSetAssembler::patch_barrier_relocation(). Do you have more details to help understand? Thanks.

@robehn
Copy link
Contributor Author

robehn commented Sep 17, 2024

default -XX:-UseCtxFencei -XX:+UseCtxFencei
fop                  2830              2731  3.63%
h2                  25361             25403 -0.17%
jython              32482             32006  1.49%
luindex              4231              4369 -3.16%
lusearch             6142              5867  4.69%
lusearch-fix         6337              6243  1.51%
pmd                  9171              8970  2.24%
                                             1.46%
zgc xmx512m
fop                  2583              2376  8.71%
h2                  34520             33518  2.99%
jython              36494             35243  3.55%
luindex              4615              4497  2.62%
lusearch             6705              6732 -0.40%
lusearch-fix         6827              6485  5.27%
pmd                  8385              8128  3.16%
                                             3.70%
zgc xmx128m
fop                  3336              2718 22.74%
jython              44385             35957 23.44%
luindex              4552              4460  2.06%
lusearch             9604              7272 32.07%
lusearch-fix        10095              7489 34.80%
pmd                 11238              9169 22.57%
                                            22.95%

@robehn
Copy link
Contributor Author

robehn commented Sep 17, 2024

We have two major categories of cmodx:

  1. Executioner is not aware of any changes.
    1.1 Writer can icache_flush_all or executioner can always emit fence.i AND UseCtxFencei.
    If the code is not changed often the inline fence.i is just costly and icache_flush_all is better.
    1.2. We can enhance this by signaling a thread handshake which will force all threads to emit a fence.i.
    Unclear if this is worth it, because under some situations it can take a while to flush that out.
  2. Executioner is aware of any changes made:
    2.1 After safepoint. No need to do icache_flush_all in a safepoint. Just emit fence.i when leaving + UseCtxFencei.
    2.2 Nmethod entry barrier, same here.

The patch you are referring to was dealing with 1, which we shouldn't IMHO.
I'm dealing with 2, in a safepoint any code changes do not require icache_flush_all and all entries into nmethod is handle through the barrier. Changes to nmethod do not need icache_flush_all if barrier used.

1: When we updates oops in code stream during safepoint or with nmethod barrier locked in Relocation::pd_set_data_value we do not need to flush.
2: When ZGC updates the color in code stream with nmethod barrier lock we do not need to flush.

@luhenry
Copy link
Member

luhenry commented Sep 17, 2024

default -XX:-UseCtxFencei -XX:+UseCtxFencei
fop                  2830              2731  3.63%
h2                  25361             25403 -0.17%
jython              32482             32006  1.49%
luindex              4231              4369 -3.16%
lusearch             6142              5867  4.69%
lusearch-fix         6337              6243  1.51%
pmd                  9171              8970  2.24%
                                             1.46%
zgc xmx512m
fop                  2583              2376  8.71%
h2                  34520             33518  2.99%
jython              36494             35243  3.55%
luindex              4615              4497  2.62%
lusearch             6705              6732 -0.40%
lusearch-fix         6827              6485  5.27%
pmd                  8385              8128  3.16%
                                             3.70%
zgc xmx128m
fop                  3336              2718 22.74%
jython              44385             35957 23.44%
luindex              4552              4460  2.06%
lusearch             9604              7272 32.07%
lusearch-fix        10095              7489 34.80%
pmd                 11238              9169 22.57%
                                            22.95%

What's the unit of measurement? Also, lower is better? higher is better?

@robehn
Copy link
Contributor Author

robehn commented Sep 17, 2024

What's the unit of measurement? Also, lower is better? higher is better?

Those are measured in milliseconds to completion, so lower is better.

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the performance numbers. It's a pity that I can't try this on my big RV machine for now which runs an older customized 6.1 kernel. BTW: Should we add similar checking like verify_cross_modify_fence_not_required() which was called in MacroAssembler::read_polling_page() & MacroAssembler::build_frame() and removed by #9770? (-XX:+VerifyCrossModifyFence)

@robehn
Copy link
Contributor Author

robehn commented Sep 18, 2024

Thanks for sharing the performance numbers. It's a pity that I can't try this on my big RV machine for now which runs an older customized 6.1 kernel. BTW: Should we add similar checking like verify_cross_modify_fence_not_required() which was called in MacroAssembler::read_polling_page() & MacroAssembler::build_frame() and removed by #9770? (-XX:+VerifyCrossModifyFence)

As we have two cases:
Code stream changed during a safepoint you must emit before leaving the safepoint, which means e.g. a thread in native that returns to Java must emit it. Polls are only disarmed by threads them self and they always do a cmodx_fence after a disarm. (when they update the poll word)
This case is already covered by VerifyCrossModifyFence.

The second case, writing in the code stream in methods with nmethod barrier locked, would be essentially be a verfication of the barrier and the _patching_epoch.
If this needs verification, we need that regardless of this patch since do need both of them working, otherwise we may enter a nmethod with bad oop and we do need the loadload fence before entering, otherwise we may load stale data.

So I don't believe this patch require any additional verification, but we may need verification.

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated change looks reasonable to me. Thanks for the answers and details.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 20, 2024
@robehn
Copy link
Contributor Author

robehn commented Sep 20, 2024

Thanks @Hamlin-Li, @luhenry and @RealFYang !
I'll let this sit until next week, doing some extras test runs.

@robehn
Copy link
Contributor Author

robehn commented Sep 25, 2024

/integrate

@openjdk
Copy link

openjdk bot commented Sep 25, 2024

Going to push as commit 97a3933.
Since your change was applied there have been 66 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 25, 2024
@openjdk openjdk bot closed this Sep 25, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 25, 2024
@openjdk
Copy link

openjdk bot commented Sep 25, 2024

@robehn Pushed as commit 97a3933.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@robehn robehn deleted the cmodx-fence branch November 27, 2024 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot [email protected] integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants