Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panicking with STATUS_INTEGER_DIVIDE_BY_ZERO #60

Closed
fasteinke opened this issue Dec 5, 2024 · 30 comments
Closed

Panicking with STATUS_INTEGER_DIVIDE_BY_ZERO #60

fasteinke opened this issue Dec 5, 2024 · 30 comments

Comments

@fasteinke
Copy link

fasteinke commented Dec 5, 2024

Still having issues with both, integer divide by zero, and, mainly, tokio panicking - latest source downloaded, compiled, and immediately get, with not a single training step run:

...
...
Compiling brush-desktop v0.1.0 (D:\Apps\brush-main\crates\brush-desktop)
Finished dev profile [unoptimized + debuginfo] target(s) in 20m 52s
Running target\debug\brush_bin.exe
thread 'tokio-runtime-worker' panicked at C:\Users\fstei.cargo\git\checkouts\cubecl-aa41a28b39b598f9\1c4e003\crates\cubecl-linalg\src\matmul\base.rs:51:14:
Accelerated strategy should be available on your device: Unable to launch matmul because a required feature is unavailable: Cmma on inputs Float(F16) and outputs Float(F32) with shape m=16, n=16, k=16 not supported.

This is on a test test of 29 images of a hall, with training resolution of 702 - which seems to be a 'magic number" for getting this set to crash.

Edit: even the hotdog dataset now causes a panic, but in this case the training did start ...

@fasteinke
Copy link
Author

Confirming that 702 above is 'magic' - the image size is 5616 - when doing the same run with the default res. of 1920, the training works without a glitch ...

@ArthurBrussee
Copy link
Owner

Hiya! The crash is known - it's this #53 which is just waiting on Burn to fix something. Maybe I should downgrade burn... will do if they still haven't landed the fix tomorrow.

The magic resolution is a bit eh.... strange haha. I suspect it's not directly related, but it's a matter of finding the right params to create some degenerate gaussian that then takes everything down.

I still don't have a lot of leads but working on something that should defend against one possible source of this. Very frustrating!

If you're comfortable sharing your dataset that might help track it down, it's hard to say why it doesn't seem to happen for me on other datasets but yeah, it could be aligning some stars just right.

@fasteinke
Copy link
Author

fasteinke commented Dec 6, 2024

Thanks for getting back so quickly ... no mystery about the dataset, this is the standard COLMAP example, at https://colmap.github.io/datasets.html. I use a subset of the Gerrard-Hall images, the 29 which image the end wall where there is a single handrail set leading up to a door. What I look for is how well that handrail is 'understood', by the training; the other thing I check on is how many windows on the separate building on the left are resolved, as shown in IMG_2334.JPG.

@ArthurBrussee
Copy link
Owner

#63 May or may not fix some things.

@fasteinke
Copy link
Author

fasteinke commented Dec 7, 2024

Hope so ... yesterday was frustrating: was getting some very clean training done on the hall dataset, using latest brush; then tried different resolution, and from there the crashing seemed unstoppable; magic numbers were everywhere!

@fasteinke
Copy link
Author

fasteinke commented Dec 7, 2024

Nope ... full recompile of latest code - and dreaded integer divide by zero popped up, again. The hall subset of 29 images; resolution of 992, SH of zero, Normal. Got to end of warm-up steps first run, middle of warm-up on next - no consistency, there ...

Edit: Friend mentioned FP traps, and initialisation of everything, as things to look at.

@fasteinke
Copy link
Author

And, as a workaround, as mentioned before, first do a run with 'good' numbers, say a res of 351, pause at 3000 or so steps, adjust params to what is actually wanted, and load file again - training then proceeds normally ...

@fasteinke
Copy link
Author

Playing with brush is like wrestling with a greasy snake ... just when you think you've got a decent grasp of how to make it behave, then it goes into a fit of crashing, with every small variation of trying something, not getting anywhere - oh, well ... :-D

@fasteinke
Copy link
Author

Some good news ... for whatever reasons, the updates a day or so ago have stabilised the program; unless I really push it by requesting very high resolution, the training runs fine - on test runs, with low res, the speed of training is almost astonishing; I'm impressed!

Thanks!

@fasteinke
Copy link
Author

It's baaaack ... unfortunately, the divide by zero issue is back with a vengeance. Recent updates have worked well, I've been able to push brush to the point of getting, this time valid, out of memory messages; but the latest source, using latest burn, crashes as soon as higher res is asked for - doesn't even get through the warmup phase.

Shame after the recent work you've done - hopefully, something like reverting to a previous burn resolves this ...

Last thing - doesn't like a folder with a '-' character in the name; won't pick up the files within.

@fasteinke
Copy link
Author

To get more info, I rolled back the cargo. files in the head directory to yesterday's release. Which happily compiled the latest source, and now a training run has no problem passing the glitching point ...

@ArthurBrussee
Copy link
Owner

Ahh shit... thought we were getting somewhere! It's quite interesting that the recent updates have helped.

I still have no reproduction on 3 seperate machines with different datasets :/ To be fair, I have gotten another mysterious crash but no STATUS_INTEGER_DIVIDE_BY_ZERO - such a frustrating bug.

If reverting helps though that might be a good clue! Is it possible to jump between the commits and see where things break? Is it the Burn upgrade?

That would help massively, thank you for all the help and you continuing to try things!

@ArthurBrussee ArthurBrussee changed the title Still panicking at certain resolutions, etc Panicking with STATUS_INTEGER_DIVIDE_BY_ZERO Dec 13, 2024
@fasteinke
Copy link
Author

fasteinke commented Dec 13, 2024

Okay, specifically, the commits done on the Dec 13 were left in, but I rolled back the cargo.lock and cargo.toml in /brush folder, per Dec 12. So, "Fix bench" up to "Fix wasm, re-enable ssim on wasm" commits were compiled in, but the cargo files were those per the "Fix tests" commit.

Note: the binary doesn't actually panic; what I get is:

error: process didn't exit successfully: target\debug\brush_bin.exe (exit code: 0xc0000094, STATUS_INTEGER_DIVIDE_BY_ZERO)

@ArthurBrussee
Copy link
Owner

Ok! So that does sound like the Burn upgrade. That's a good clue, though it might not be 100% Burns fault, that update includes a different way of handling some out of bounds memory accessess. I've been trying to see if the Vulkan validation layers come up with anything but no luck so far, let's see.

@fasteinke
Copy link
Author

A bit more feedback ... I can provoke a binary using the older version of burn into throwing the divide by zero error - but it's harder to do. Using the bicycle preset dataset, with larger refine every parameter, and more frequent opacity resets, and an init.ply checkpoint - trying to find a pattern, but it ended up confusing me. Strangely, it almost seemed that running the binary identically, in succession, made the error go away - some bizarre OS memory management issue?

@ArthurBrussee
Copy link
Owner

Tell me about it 😅 I still can't repro STATUS_INTEGER_DIVIDE_BY_ZERO - but have seen some other weird glitches on a larger training run (specifically, youtube videos going haywire, GPU memory corruption?). I've tried many things to get a pattern out of it but it's so random and stochastic... Very occasionally I would get a crash (be it without STATUS_INTEGER_DIVIDE_BY_ZERO) so I do imagine it's the same problem, maybe.

As far as I did get anywhere, I also found that frequent refines & opacity resets seem to make things worse... Refines cause some bad GPU memory allocation patterns, so maybe that is related? Maybe it's all related to being near OOM?

So tricky... Thanks for the digging!

@fasteinke
Copy link
Author

Can't see it being OOM ... The crashes I get when I am pushing memory are classic panics, with suitable messages, "Can't allocate ...". The divides by zero appear often when starting, miles of room for the program to do its thing ...

@ArthurBrussee
Copy link
Owner

Ah thank you, that is an interesting data point, I've only ever gotten the glitchy behaviour near OOM (4m+ splats, 1245 pixels wide images).

Have you seen it on:

  • low splat counts
  • low resolutions (say, less than full HD)
  • window focused / not focused.
  • updating live / not updating live
  • any combination of the above

@fasteinke
Copy link
Author

Low splat counts: Yes - best splat count I've got has barely squeezed into 3M; approaching these numbers gives me an OOM message

Low resolutions: Yes - though, higher res makes it much more likely

Window focus: Can't say

Updating live: Normally always updating - will try this off, to see

One other thing: have had a couple of crashes during export; a ply is created, but with no colour information!

@ArthurBrussee
Copy link
Owner

Ohh yes that is also interesting. The SH buffer is MUCH larger than the other parameters, so perhaps that's messing with some memory allocation? Can you try seeing if things are better without SH enabled? Thank you this is really helpful!

@fasteinke
Copy link
Author

fasteinke commented Dec 16, 2024

Another data point ... just tried SH 0, full res, with an init.ply on the bicycle set - got to 2M splats, after 4000 steps; tried to export ... crash! Panic with "Parent is lost", etc.

Again, a valid ply written, viewer can navigate it. But, all splats are grey ...

@fasteinke
Copy link
Author

Finally!! ... light at the end of, etc ..,

Tried your latest, "extreme clippy" (but not later commits) version, per the zip - no cargo changes. First signs, very encouraging - no immediate error. But then the magic number syndrome - using a different init.ply, consistent crashing - at around 190 steps; try again, this time 180, and yet again - gets to just over 200! Switch off live update - ah hah!! Clean running through to an OOM ... momentary on, then off, doesn't trigger a crash.

Haven't sampled the very latest, in case anything there helps ... later today.

@ArthurBrussee
Copy link
Owner

That is interesting! Makes it sound like it's a multithreading issue - the rendering does a seperate queue.submit() vs. the training. I tried wgpu 23 before it was out and found a somewhat similair crash ages ago: gfx-rs/wgpu#6279

I just tried latest wgpu 23, and I couldn't get that crash now, so maybe they have fixed something in that area.

I've pushed a version that updates to wgpu 23. If that crashes, I'll try a version that forces a lock between the training and rendering and see if that helps at all

@fasteinke
Copy link
Author

Likewise here ... have updated to just prior the "[More seperation of process loop, ... " commit, and the training is indeed solid. Okay, I had a single divide by zero crash; but it was a peculiar situation, can't recall precisely the circumstances - if it turns up again, I'll mention it.

One thing: if I have live update switched off, and then pause and load for new training run, the scene window doesn't update, at all. I have to close and restart the app, to restore the functionality.

@ArthurBrussee
Copy link
Owner

Ah just to double check - "likewise" as in, same as before? So crashing with live update on, not crashing with it off?

And oh yes I've hit that other bug too. A GH issue is always appreciated to not forget about such things! Do hope to fix it soon, the rework to make the CLI should make some of these things a bit cleaner.

@fasteinke
Copy link
Author

Well, I thought I was in better shape! Now, not so sure ...

Likewise meant that overall it was running clean - so, live update on or off seemed not to matter. But, the next day, crash after crash after crash! Live update off didn't even seem to solve things! Even a completely standard training of bicycle failed ... talk about being confused ...

So many variables to play with ... I see a pattern, and then it disappears. Next round, I will take the latest brush, and work through a variety of scenarios, starting from simple, and building up - see if anything makes sense.

@ArthurBrussee
Copy link
Owner

This is also quite interesting: KhronosGroup/Vulkan-Tools#1059

Can you run vulkaninfo and paste the output here? Curious what version you are on! Plus seeing some other stats about your setup might help.

Also, I just added a way to train from the CLI as well. In theory that should fix things too, as there's no multithreaded submission of GPU commands at all anymore in there. Curious to see if that's indeed the case!

@ArthurBrussee
Copy link
Owner

Hurr someone else on the Brush discord had this issue and it was solved after getting the latest nVidia drivers 😅 So just in case, do give that a go as well!

@fasteinke
Copy link
Author

fasteinke commented Dec 24, 2024

Will try the vulkaninfo, etc, things soon ... in the meantime, installed latest nVidia driver, and so far haven't triggered the error - fingers crossed!

Brush discord? Sounds good ... but, can't see it anywhere ... ??

@fasteinke
Copy link
Author

fasteinke commented Dec 24, 2024

Duuhhhh ... was there all the time, but brain glitched every time I looked at it!!

Next,
<rant>I loath and despise sign on procedures ... anywhere!!! They go wrong, for some weird and wonderful reason - and then you're stuck!! You can't unset the process - and so you're in limbo! I curse all programmers who don't debug these routines enough ... !!!
</rant>

Umm, I can't "claim my account" - discord says I am there, as fas42, and just stares at me, in the Finish Signing Up box ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants