-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix memory limit exceeded problem when writing a partitioned Parquet table #43672
base: main
Are you sure you want to change the base?
Conversation
bfef844
to
473ad00
Compare
@mxdzs0612 I think we should try to merge small files before committing the final results for further read performance. |
@@ -48,7 +49,13 @@ std::future<Status> ParquetFileWriter::write(ChunkPtr chunk) { | |||
if (auto status = _rowgroup_writer->write(chunk.get()); !status.ok()) { | |||
return make_ready_future(std::move(status)); | |||
} | |||
if (_rowgroup_writer->estimated_buffered_bytes() >= _writer_options->rowgroup_size) { | |||
double mem_usage = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to implement this in ConnectorSinkOperator
so all file formats can benefit from this
It is not a good choice to sense the memory state within the operator and this should be decided before scheduling. |
btw, seems that what you want to do is similar to #25053, maybe you can refer it to get more context |
Signed-off-by: Jiao Mingye <[email protected]>
Signed-off-by: Jiao Mingye <[email protected]>
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 6 / 6 (100.00%) file detail
|
Why I'm doing:
We may encounter the following problem when writing a Parquet table with lots of partitions by connector sink. (e.g. partitioned by dayofyear(xxxx))
This PR fixs the problem by flushing row groups when memory usage of a BE almost reaches the threshold, which is 80% of the total memory by default.
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: