-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory allocation error due to Longtext column #43
Comments
Hello, I can look at this later, but your report has been very helpful so far. May I draw your attention to the Guess in your case that is too much. You can lower the batch size of course. Likely to work because modern machines have stupid amounts of memory, but it will be slow, because you need a roundtrip for each row of the database. Currently I have no option for you to provide the domain knowledege of the maximum value size via command line parameter (would this help you?). Sometimes casting to a fixed VARCHAR() in the select statement helps (if you try it, please tell me how it goes). The very least I can do is to provide a better error message. I am totally open to suggestions and ideas from your side. |
Well, do I feel stupid now :D The domain knowledge bit could be interesting, but I'm guessing it's a pain in the ass to implement and also given the very uhm let's call it "questionable" design of my source db here, not sure it'll really work out great. This isn't the only field that's completely off in sizing, just the largest one of the ones being off. And as it could technically change what the largest data is in there (given that the data type obviously allows it) that could be tricky for exports that run more frequently. Unless there's some way of determining something like I'll have to look into the cast option just out of curiosity, but the smaller batch size is fine for me now, so this would just be to see if I can make it work with that at all. Currently the cast yells at me because of unicode/ascii issues so not entirely sure yet if that's a feasible way or not. Yeah not sure what options you have regarding error message. Can you like "catch" the allocation error and print something along the lines of what you mentioned here regarding largest column times batch size? It completely makes sense but just didn't occur to me at all. |
I think you are far from stupid, and I am happy you raised the issue. I do not know how much I can do to make this work out of the box, but at least the tool has a ux problem in case of out of memory errors. I could calculate the required size beforehand and give a fair warning if it goes over an amount. I also came across this gem in the ODBC4.0 specification: That would allow me to set an upper bound. And fetch truncated values later. Documentation is sparse though. And this probably needs support from the driver, too (don't know, just a guess). Yet it is worth checking out. |
How would you feel about specifying the |
That sounds like a great idea, that would make it really well plannable. |
So far my strategy for handling these large columns and memory allocations in general is:
|
Newest version allows for specifying desired memory usage. Defaults to 2 GiB on 64 Bit platforms. There is still more that could be done. Both in terms of either streaming large values or being able to increase the buffer size upon encountering larger values. At least the latter does require driver support, and some research on my side is required wether this would work with MariaDB. |
Hey, not entirely sure this issue will make sense, but I'm an absolute parquet noob and found your tool as a way of dumping stuff from a MariaDB to Parquet to provide this to other folks.
I'm encountering a memory allocation error
which I'm fairly certain should be connected to the following
which in MySQL is this
The factor between the columns
max_str_len
and the memory allocation is a bit more than 100000 so this appears too connected to be random to me.I have no influence over the source data, so I will not be able to convince anyone to change the type of this field from LONGTEXT to something more reasonable. The largest entry in this column is 366211 characters, so there's definitely no data in there that would require a memory allocation of 143TB.
I'm not entirely sure why this happens though, hence this issue.
The maximum length for an entry in a LONGTEXT column is 4.3GB, which again, none of the entries are even close to having, but no one will be touching this. But how could this lead to a memory allocation of a bit more than 100000 that?
I'm guessing the allocation happens somewhere around https://github.com/pacman82/odbc2parquet/blob/master/src/query.rs#L417-L428 given that this is a field of type other? The entire loop runs through though as you can see above. The table in question has 182 columns and we see column/buffer descriptions for every column. The memory allocation error happens after that.
Do you have any ideas of what could be done about this? Would be really nice to dump this data into Parquet but with it randomly crashing right now I'm entirely at a loss :)
I'm running
odbc2parquet 0.5.9
installed via cargo on debian buster.If I can provide any more data that could help here I'm completely up for that!
The text was updated successfully, but these errors were encountered: