-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results when using different numbers of MPI ranks in ALI #688
Comments
A smaller problem for comparison, after one iteration:
|
The Humboldt results look reasonable to me. |
I checked the difference between the Jacobian in the 2 rank case vs. the 20 rank case:
The results are similar if I use homogeneous DBC on the basalside with no BC on lateral: 3.7253e-09 and 1.6862e-08 |
Is this an absolute difference or a relative one? |
absolute |
Ok. It might be pointing to a wrongly assembled jacobian. Meanwhile, I double checked a basic pb (SteadyHeat2D), running it with 1 and 32 ranks: the NOX residuals start to be different (1e-3 relative diff) only at the 3rd nonlinear iteration. HOWEVER, the linear solver history is quite different already with the 1st jacobian: 16 (serial) vs 43 (32 ranks) iterations. Edit: this is a Tpetra run, with Belos solver and Ifpack2 prec (ILUT with fill-in 1). |
However, J_serial-J_parallel ~ 1e-16. So this is interesting: the jacobian is basically the same, and yet we get very different convergence histories... I'm not an expert of Ifpack, but: when using several ranks, some subdomain contains no Dirichlet rows. Could this deteriorate solver performance (due to singular or almost singular local matrix)? |
Using MueLu (instead of Ifpack2) gives the same linear iterations (serial vs 32-ranks). Like serial:
vs 32 ranks:
This does not explain the ALI tests failures, where the prec was ML or MueLu. But at least confirms that the core part of Albany is not buggy. It also suggests that the choice of the preconditioner type might be affecting convergence in some cases. Edit: for completeness, these are the Ifpack2 and MueLu params: Ifpack2:
MueLu:
|
Thanks for those results. FYI, if I use relative difference it's 1.0446e-13 and 2.8920e-13. |
1e-13 is not quite machine precision, but it might be due to some carries, so it might still be a roundoff diff. This is good, since it suggests that we're not messing up the assembly (phew). My best guess is that we need to be careful with our prec specifications. I just tried @jewatkins could you remind me which input file and input mesh you used? |
I'm using a modified version of https://github.com/ikalash/ali-perf-tests/blob/master/perf_tests/humboldt-3-20km/input_albany_Velocity_MueLu_Wedge.yaml where I switch to "Use Serial Mesh: true" and Epetra/AztecOO/ML. Mesh files are here: https://github.com/ikalash/ali-perf-tests/tree/master/meshes/humboldt-3-20km I could add the exact input file I'm using if you want to try running it. |
Thanks! No need to add the exact input file. |
So, I tried that input file, with the meshes in Is this what you are observing for that mesh/input-file? Note: I'm not using SFad, I'm using DFad (the default in Albany). |
OK, so we don't get significant changes on the Humbolt mesh. |
Yes, that sounds about right. I used 2 and 20 ranks and these are the results I got:
On the Greenland mesh, I've tried Belos/AztecOO, muelu/ml and it didn't make a difference (I still had an inconsistency).
I tried 8 ranks and 80 ranks with Thwaites and did not see a large discrepancy (linear iterations: 95/96, solution average: -162.729919149/-162.729919148). I've only seen a large difference when running the Greenland 1-7km and 1-10km. Maybe the large variation in the domain of those meshes causes the error to be more noticeable? There's still a small difference in the Humboldt case so the error could still be there, just not as noticeable. |
Uhm, Thwaites is a reasonably sized ice sheet and with non trivial dynamics. So, I suspect there's something wrong with those meshes that makes the problem very ill conditioned. |
Another interesting bit of information. For the Humboldt case, if I use similar boundary conditions and ML settings to the nightly test (must be both), I get the exact number of iterations and solution average. I.e. if I add this:
I get this for both:
|
What BC where you using before for Humboldt? (B.t.w. when I was telling you not to prescribe any Dirichlet BC on the laterla boundary I was referring to Greenland, not to Humboldt). |
It might be that some ranks pick up a high-speed region, while others pick up slow-speed ones. This is especially possible for highly non-uniform meshes. |
I'm okay with this. Since there's cases where I'm able to get the same number of iterations/average solution between a low/high rank case, I'll assume that this is some sort of issue with the mesh and continue trying to improve the GPU runs (working on cases that work like thwaites). I recall seeing the same issue there but maybe not because I can't find the runs. |
I was using this:
Something similar is done in the nightly input file too but it has that extra term on |
Humboldt mesh is obtained cutting out a piece of Greenland ice sheet. On the internal (artificial) boundary (node_set_3) we prescribe the velocity obtained from a full Greenland simulation. On the real ice margin (side_set_1) we prescribe the usual lateral condition, which account for ocean backpressure if the ice is partially submerged in ocean. |
Yeah, I think it's probably the best course of action. |
Ah okay, thanks for the info. I should update my Humboldt cases. |
I ran into an interesting result when comparing linear convergence and solution average for different numbers of MPI ranks in Albany Land Ice:
The difference in linear convergence and solution average seems pretty large. My nonlinear tolerance is set to 1.0e-6 and my linear tolerance is set to 1.0e-8. I didn't see much of a difference when reducing these numbers. I also didn't see much of a difference when I tried this same experiment on smaller problems. I'm still investigating to see what might cause this but let me know if anyone has any thoughts.
The text was updated successfully, but these errors were encountered: