-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panicking with STATUS_INTEGER_DIVIDE_BY_ZERO #60
Comments
Confirming that 702 above is 'magic' - the image size is 5616 - when doing the same run with the default res. of 1920, the training works without a glitch ... |
Hiya! The crash is known - it's this #53 which is just waiting on Burn to fix something. Maybe I should downgrade burn... will do if they still haven't landed the fix tomorrow. The magic resolution is a bit eh.... strange haha. I suspect it's not directly related, but it's a matter of finding the right params to create some degenerate gaussian that then takes everything down. I still don't have a lot of leads but working on something that should defend against one possible source of this. Very frustrating! If you're comfortable sharing your dataset that might help track it down, it's hard to say why it doesn't seem to happen for me on other datasets but yeah, it could be aligning some stars just right. |
Thanks for getting back so quickly ... no mystery about the dataset, this is the standard COLMAP example, at https://colmap.github.io/datasets.html. I use a subset of the Gerrard-Hall images, the 29 which image the end wall where there is a single handrail set leading up to a door. What I look for is how well that handrail is 'understood', by the training; the other thing I check on is how many windows on the separate building on the left are resolved, as shown in IMG_2334.JPG. |
#63 May or may not fix some things. |
Hope so ... yesterday was frustrating: was getting some very clean training done on the hall dataset, using latest brush; then tried different resolution, and from there the crashing seemed unstoppable; magic numbers were everywhere! |
Nope ... full recompile of latest code - and dreaded integer divide by zero popped up, again. The hall subset of 29 images; resolution of 992, SH of zero, Normal. Got to end of warm-up steps first run, middle of warm-up on next - no consistency, there ... Edit: Friend mentioned FP traps, and initialisation of everything, as things to look at. |
And, as a workaround, as mentioned before, first do a run with 'good' numbers, say a res of 351, pause at 3000 or so steps, adjust params to what is actually wanted, and load file again - training then proceeds normally ... |
Playing with brush is like wrestling with a greasy snake ... just when you think you've got a decent grasp of how to make it behave, then it goes into a fit of crashing, with every small variation of trying something, not getting anywhere - oh, well ... :-D |
Some good news ... for whatever reasons, the updates a day or so ago have stabilised the program; unless I really push it by requesting very high resolution, the training runs fine - on test runs, with low res, the speed of training is almost astonishing; I'm impressed! Thanks! |
It's baaaack ... unfortunately, the divide by zero issue is back with a vengeance. Recent updates have worked well, I've been able to push brush to the point of getting, this time valid, out of memory messages; but the latest source, using latest burn, crashes as soon as higher res is asked for - doesn't even get through the warmup phase. Shame after the recent work you've done - hopefully, something like reverting to a previous burn resolves this ... Last thing - doesn't like a folder with a '-' character in the name; won't pick up the files within. |
To get more info, I rolled back the cargo. files in the head directory to yesterday's release. Which happily compiled the latest source, and now a training run has no problem passing the glitching point ... |
Ahh shit... thought we were getting somewhere! It's quite interesting that the recent updates have helped. I still have no reproduction on 3 seperate machines with different datasets :/ To be fair, I have gotten another mysterious crash but no STATUS_INTEGER_DIVIDE_BY_ZERO - such a frustrating bug. If reverting helps though that might be a good clue! Is it possible to jump between the commits and see where things break? Is it the Burn upgrade? That would help massively, thank you for all the help and you continuing to try things! |
Okay, specifically, the commits done on the Dec 13 were left in, but I rolled back the cargo.lock and cargo.toml in /brush folder, per Dec 12. So, "Fix bench" up to "Fix wasm, re-enable ssim on wasm" commits were compiled in, but the cargo files were those per the "Fix tests" commit. Note: the binary doesn't actually panic; what I get is: error: process didn't exit successfully: |
Ok! So that does sound like the Burn upgrade. That's a good clue, though it might not be 100% Burns fault, that update includes a different way of handling some out of bounds memory accessess. I've been trying to see if the Vulkan validation layers come up with anything but no luck so far, let's see. |
A bit more feedback ... I can provoke a binary using the older version of burn into throwing the divide by zero error - but it's harder to do. Using the bicycle preset dataset, with larger refine every parameter, and more frequent opacity resets, and an init.ply checkpoint - trying to find a pattern, but it ended up confusing me. Strangely, it almost seemed that running the binary identically, in succession, made the error go away - some bizarre OS memory management issue? |
Tell me about it 😅 I still can't repro STATUS_INTEGER_DIVIDE_BY_ZERO - but have seen some other weird glitches on a larger training run (specifically, youtube videos going haywire, GPU memory corruption?). I've tried many things to get a pattern out of it but it's so random and stochastic... Very occasionally I would get a crash (be it without STATUS_INTEGER_DIVIDE_BY_ZERO) so I do imagine it's the same problem, maybe. As far as I did get anywhere, I also found that frequent refines & opacity resets seem to make things worse... Refines cause some bad GPU memory allocation patterns, so maybe that is related? Maybe it's all related to being near OOM? So tricky... Thanks for the digging! |
Can't see it being OOM ... The crashes I get when I am pushing memory are classic panics, with suitable messages, "Can't allocate ...". The divides by zero appear often when starting, miles of room for the program to do its thing ... |
Ah thank you, that is an interesting data point, I've only ever gotten the glitchy behaviour near OOM (4m+ splats, 1245 pixels wide images). Have you seen it on:
|
Low splat counts: Yes - best splat count I've got has barely squeezed into 3M; approaching these numbers gives me an OOM message Low resolutions: Yes - though, higher res makes it much more likely Window focus: Can't say Updating live: Normally always updating - will try this off, to see One other thing: have had a couple of crashes during export; a ply is created, but with no colour information! |
Ohh yes that is also interesting. The SH buffer is MUCH larger than the other parameters, so perhaps that's messing with some memory allocation? Can you try seeing if things are better without SH enabled? Thank you this is really helpful! |
Another data point ... just tried SH 0, full res, with an init.ply on the bicycle set - got to 2M splats, after 4000 steps; tried to export ... crash! Panic with "Parent is lost", etc. Again, a valid ply written, viewer can navigate it. But, all splats are grey ... |
Finally!! ... light at the end of, etc .., Tried your latest, "extreme clippy" (but not later commits) version, per the zip - no cargo changes. First signs, very encouraging - no immediate error. But then the magic number syndrome - using a different init.ply, consistent crashing - at around 190 steps; try again, this time 180, and yet again - gets to just over 200! Switch off live update - ah hah!! Clean running through to an OOM ... momentary on, then off, doesn't trigger a crash. Haven't sampled the very latest, in case anything there helps ... later today. |
That is interesting! Makes it sound like it's a multithreading issue - the rendering does a seperate queue.submit() vs. the training. I tried wgpu 23 before it was out and found a somewhat similair crash ages ago: gfx-rs/wgpu#6279 I just tried latest wgpu 23, and I couldn't get that crash now, so maybe they have fixed something in that area. I've pushed a version that updates to wgpu 23. If that crashes, I'll try a version that forces a lock between the training and rendering and see if that helps at all |
Likewise here ... have updated to just prior the "[More seperation of process loop, ... " commit, and the training is indeed solid. Okay, I had a single divide by zero crash; but it was a peculiar situation, can't recall precisely the circumstances - if it turns up again, I'll mention it. One thing: if I have live update switched off, and then pause and load for new training run, the scene window doesn't update, at all. I have to close and restart the app, to restore the functionality. |
Ah just to double check - "likewise" as in, same as before? So crashing with live update on, not crashing with it off? And oh yes I've hit that other bug too. A GH issue is always appreciated to not forget about such things! Do hope to fix it soon, the rework to make the CLI should make some of these things a bit cleaner. |
Well, I thought I was in better shape! Now, not so sure ... Likewise meant that overall it was running clean - so, live update on or off seemed not to matter. But, the next day, crash after crash after crash! Live update off didn't even seem to solve things! Even a completely standard training of bicycle failed ... talk about being confused ... So many variables to play with ... I see a pattern, and then it disappears. Next round, I will take the latest brush, and work through a variety of scenarios, starting from simple, and building up - see if anything makes sense. |
This is also quite interesting: KhronosGroup/Vulkan-Tools#1059 Can you run vulkaninfo and paste the output here? Curious what version you are on! Plus seeing some other stats about your setup might help. Also, I just added a way to train from the CLI as well. In theory that should fix things too, as there's no multithreaded submission of GPU commands at all anymore in there. Curious to see if that's indeed the case! |
Hurr someone else on the Brush discord had this issue and it was solved after getting the latest nVidia drivers 😅 So just in case, do give that a go as well! |
Will try the vulkaninfo, etc, things soon ... in the meantime, installed latest nVidia driver, and so far haven't triggered the error - fingers crossed! Brush discord? Sounds good ... but, can't see it anywhere ... ?? |
Duuhhhh ... was there all the time, but brain glitched every time I looked at it!! Next, Umm, I can't "claim my account" - discord says I am there, as fas42, and just stares at me, in the Finish Signing Up box ... |
Still having issues with both, integer divide by zero, and, mainly, tokio panicking - latest source downloaded, compiled, and immediately get, with not a single training step run:
...
...
Compiling brush-desktop v0.1.0 (D:\Apps\brush-main\crates\brush-desktop)
Finished
dev
profile [unoptimized + debuginfo] target(s) in 20m 52sRunning
target\debug\brush_bin.exe
thread 'tokio-runtime-worker' panicked at C:\Users\fstei.cargo\git\checkouts\cubecl-aa41a28b39b598f9\1c4e003\crates\cubecl-linalg\src\matmul\base.rs:51:14:
Accelerated strategy should be available on your device: Unable to launch matmul because a required feature is unavailable: Cmma on inputs Float(F16) and outputs Float(F32) with shape m=16, n=16, k=16 not supported.
This is on a test test of 29 images of a hall, with training resolution of 702 - which seems to be a 'magic number" for getting this set to crash.
Edit: even the hotdog dataset now causes a panic, but in this case the training did start ...
The text was updated successfully, but these errors were encountered: