Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: Support UDF for spark-rapids-tools #1347

Draft
wants to merge 17 commits into
base: dev
Choose a base branch
from

Conversation

Heatao
Copy link

@Heatao Heatao commented Sep 18, 2024

No description provided.

@tgravescs
Copy link
Collaborator

Hello @Heatao could you please file an issue and fill in the description with the details on what we feature is being added here?
I'd like to make sure we are on the same page with the feature and how its being implemented.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the title of the PR does not correctly reflect the changes here.
IIUC, this change aims at running the Qualification tool as UDF from within spark.
I am not sure why we would need make that change into the tools core? Why not simply submitting qualification as Spark jobs?

Comment on lines +613 to +634
<!-- <exclusions>-->
<!-- <exclusion>-->
<!-- <groupId>org.codehaus.jackson</groupId>-->
<!-- <artifactId>jackson-core-asl</artifactId>-->
<!-- </exclusion>-->
<!-- <exclusion>-->
<!-- <groupId>org.codehaus.jackson</groupId>-->
<!-- <artifactId>jackson-mapper-asl</artifactId>-->
<!-- </exclusion>-->
<!-- <exclusion>-->
<!-- <groupId>org.codehaus.jackson</groupId>-->
<!-- <artifactId>jackson-jaxrs</artifactId>-->
<!-- </exclusion>-->
<!-- <exclusion>-->
<!-- <groupId>org.codehaus.jackson</groupId>-->
<!-- <artifactId>jackson-xc</artifactId>-->
<!-- </exclusion>-->
<!-- <exclusion>-->
<!-- <groupId>com.google.protobuf</groupId>-->
<!-- <artifactId>protobuf-java</artifactId>-->
<!-- </exclusion>-->
<!-- </exclusions>-->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented code

@@ -443,6 +458,7 @@
<spark314.version>3.1.4-SNAPSHOT</spark314.version>
<spark320.version>3.2.0</spark320.version>
<spark321.version>3.2.1</spark321.version>
<bdspark321.version>3.2.1-bd1-SNAPSHOT</bdspark321.version>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this available on mvn public repo?

@@ -502,6 +518,7 @@
<project.build.sourceEncoding>${platform-encoding}</project.build.sourceEncoding>
<project.reporting.sourceEncoding>${platform-encoding}</project.reporting.sourceEncoding>
<project.reporting.outputEncoding>${platform-encoding}</project.reporting.outputEncoding>
<hive.version>1.2.2-bd103</hive.version>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this available on public mvn? and why do we need to define custom hive version to do basic functionalities? it will only run in a specific environment.

RandomForestRegressor-pyspark,3.66
RandomForestRegressor-scala,1
XGBoost-pyspark,1
XGBoost-scala,3.31
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is going to be deprecated and removed anyway since we rely on qualx for speedups.
FWIW, it is missing some expressions added recently to the qualification tool (i.e., RoundCeil,4
RoundFloor,4
BloomFilterMightContain,4
BloomFilterAggregate,4
EphemeralSubstring,4
KnownNullable,4
InSubqueryExec,4
AQEShuffleReadExec,4
CheckOverflowInTableInsert,4
ArrayFilter,1.5
BoundReference,1.5
HiveHash,1.5
MapFromArrays,1.5
DecimalSum,1.5
MaxBy,1.5
MinBy,1.5
ArrayJoin,1.5

None, None, List(eventPath), hadoopConf)
val filteredLogs = eventLogFsFiltered

// uiEnabled = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unnecessary code. uiEnabled has been removed from the code base

val filteredLogs = eventLogFsFiltered

// uiEnabled = false
val qual = new Qualification(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to call QualificationMain.mainInternal instead

}
}

private def execMLPredict(applicationId: String, outputDirectory: String): String = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture of the tools is that jar is being executed by the python wrapper.
Putting an exec to run the predict command from within the jar implies that we have both the jar and the wrapper python are calling each other which is inconsistent and duplicating the work.

metadata, metrics) {
val stageId: Option[Int],
planId: Int = 0) extends SparkPlanInfo(nodeName, simpleString, children,
metadata, metrics, planId) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this extra argument break other environment?

@@ -0,0 +1,148 @@
/*
* Copyright (c) 2021-2023, NVIDIA CORPORATION.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Update copyright year in this file and all other files where applicable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 it should be Copyright (c) 2024 for newly added files

@wjxiz1992
Copy link
Collaborator

wjxiz1992 commented Sep 19, 2024

Some background:

  1. Customer is running in their k8s cluster with a customized docker container, cannot mount external volume => local files will be lost after container is desctroyed.
  2. Their Spark is highly customized, not any public community version, as well as some other dependencies.
  3. Target is to get the "prediction speedup" produced by qualx for their 30,000+ Spark jobs.
  4. blocked by [FEA] Tool output file support for writing to HDFS file system #1348.

This is more for a show/description of what the customer is trying to do, so that we can understand and help accordingly.

@tgravescs
Copy link
Collaborator

Target is to get the "prediction speedup" produced by qualx for their 30,000+ Spark jobs.

Qual tool has been explicitly moving away from speedup numbers so I think we want to be careful about adding any tools that emphasizes it again. I realize different customers may fit better but we need to figure out how to expose to users if we chose or if its really a customer specific type tool.

@amahussein amahussein marked this pull request as draft October 15, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants