Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8330][VL] Improve convert the viewfs path to hdfs path #8331

Merged
merged 4 commits into from
Dec 26, 2024

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Dec 24, 2024

What changes were proposed in this pull request?

  1. Avoid RPC calls.
  2. Add a cache.

(Fixes: #8330)

How was this patch tested?

Manual tests, before: 225007 ms, after: 1330 ms:

import org.apache.hadoop.fs.viewfs.ViewFileSystemUtils
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val paths = fs.listStatus(new Path("viewfs://my-cluster/path/to/table")).take(50000).map(_.getPath.toString)

val start1 = System.currentTimeMillis
paths.map(p => FileSystem.get(new Path(p).toUri, spark.sparkContext.hadoopConfiguration).resolvePath(new Path(p)))
println(s"Convert viewfs to HDFS using resolvePath took: ${System.currentTimeMillis - start1} ms")

val start2 = System.currentTimeMillis
paths.map(p => ViewFileSystemUtils.convertViewfsToHdfs(p, spark.sparkContext.hadoopConfiguration))
println(s"Convert viewfs to HDFS using convertViewfsToHdfs took: ${System.currentTimeMillis - start2} ms")

Output:

Convert viewfs to HDFS using resolvePath took: 225007 ms
Convert viewfs to HDFS using convertViewfsToHdfs took: 1330 ms

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Dec 24, 2024
Copy link

#8330

Copy link

Run Gluten Clickhouse CI on x86

@zhouyuan zhouyuan changed the title [GLUTEN-8330] Improve convert the viewfs path to hdfs path [GLUTEN-8330][VL] Improve convert the viewfs path to hdfs path Dec 25, 2024
@@ -377,26 +378,28 @@ case class WholeStageTransformer(child: SparkPlan, materializeInput: Boolean = f
val allScanSplitInfos =
getSplitInfosFromPartitions(basicScanExecTransformers, allScanPartitions)
if (GlutenConfig.getConf.enableHdfsViewfs) {
val viewfsToHdfsCache: mutable.Map[String, String] = mutable.Map.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this cache take too much memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the size of this map is at most equal to the total number of partitions of these tables.

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your work. Except two small comments.

Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Copy link

Run Gluten Clickhouse CI on x86

@JkSelf
Copy link
Contributor

JkSelf commented Dec 25, 2024

@wangyum Can you help to resolve the following compile issue? Thanks.

Error:  /__w/incubator-gluten/incubator-gluten/gluten-substrait/src/main/scala/org/apache/gluten/execution/WholeStageTransformer.scala:386: error: type mismatch;
Error:   found   : scala.collection.mutable.Buffer[String]
Error:   required: Seq[String]
Error:                    splitInfo.getPaths.asScala,
Error:                                       ^
Warning: [WARNING] 1 warning
Error: [ERROR] 1 error
Error:  exception compilation error occurred!!!

Copy link

Run Gluten Clickhouse CI on x86

@JkSelf JkSelf merged commit 5d1d0ba into apache:main Dec 26, 2024
46 checks passed
@wangyum wangyum deleted the GLUTEN-8330 branch December 26, 2024 02:04
yikf pushed a commit to yikf/incubator-gluten that referenced this pull request Dec 26, 2024
yikf pushed a commit to yikf/incubator-gluten that referenced this pull request Dec 26, 2024
yikf pushed a commit to yikf/incubator-gluten that referenced this pull request Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Improve the performance of converting from viewfs to hdfs
3 participants