-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repartition and generate a specified num of files #625
Comments
I'm not certain what the |
Just a custom parameter like this:
|
I think this makes sense if you want to control number of files written out, you will still have small files if there is data skew however. |
Another implementation would be to add a synthetic key to the Note: I did previously comment the above with some logic but looks to have been removed or didn't submit correctly. Will happily regenerate if required |
In fact, I think that users need to generate files more accurately, but I now use the spark adaptive(by
That sortKey helped me reduce the number of touch files(get by findTouchedFiles),very useful |
but |
Your implementation still may lead to many small files if your output is partitioned by a column or columns. E.g One thing to be aware of is the implementation may potentially hinder/bottleneck performance. For example output is partitioned by country, assume we have 50 countries; 49 countries have output size of 5GB each and 1 country has 20Gb. Rather than the skewed country having 1 20Gb part file we want 20 1Gb part files. Your implementation will look something like |
yes,you are right, but now i can not find a better way.Maybe we should reduce the number of hit partitions or let the data be balanced in partitions, and turn on the spark adaptive function to generate outputs. |
You could do something like this? protected def repartitionIfNeeded(
spark: SparkSession,
df: DataFrame,
partitionColumns: Seq[String]): DataFrame = {
if (partitionColumns.nonEmpty && spark.conf.get(DeltaSQLConf.MERGE_REPARTITION_BEFORE_WRITE)) {
// MERGE_MAX_PARTITION_FILES default = 1 meaning no change to original functionality
// When MERGE_MAX_PARTITION_FILES > 1 syntheticCol will help split partition into smaller
// chunks. This helps mitigate skewed partitions.
val maxFiles = spark.conf.get(DeltaSQLConf.MERGE_MAX_PARTITION_FILES)
val syntheticCol = (rand() * maxFiles).cast(IntegerType)
df.repartition(syntheticCol +: partitionColumns.map(col): _*)
} else if
(partitionColumns.isEmpty && spark.conf.get(DeltaSQLConf.MERGE_REPARTITION_BEFORE_WRITE)) {
df.repartition(spark.conf.get(DeltaSQLConf.MERGE_MAX_PARTITION_FILES))
} else {
df
}
} This should keep the parallelism in the write phase when partitioning by a column or columns. When not partitioning by a column we still have the potential to reduce or we could remove this section and let AQE do what it needs to. NB: This might not be final implementation, just an idea :) |
Well, your idea is worth to try,I will do some test fo it. I am looking for the solution of small files for a long time, it is so greate to discuss with you. |
Ah yeah it’s an interesting area and I commonly see people accept the issue rather than reduce it. |
Thanks for bringing this to our attention. We will take a look. |
I do not know why the delta repartition is repartition by partitionColumn, it may lead to data skew, like this:
so I do some change, like this:
it can repartitioned by hash and generate a specified num of files, this helps me more to control the size and number of small files.
does it resonable?
The text was updated successfully, but these errors were encountered: