Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50792][SQL] Format binary data as a binary literal in JDBC. #49452

Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -986,4 +986,23 @@ private[v2] trait V2JDBCTest extends SharedSparkSession with DockerIntegrationFu
test("scan with filter push-down with date time functions") {
testDatetime(s"$catalogAndNamespace.${caseConvert("datetime")}")
}

test("SPARK-50792 Format binary data as a binary literal in JDBC.") {
sunxiaoguang marked this conversation as resolved.
Show resolved Hide resolved
sunxiaoguang marked this conversation as resolved.
Show resolved Hide resolved
val tableName = s"$catalogName.test_binary_literal"
withTable(tableName) {
// Unfornately, Oracle can only compare two BLOBs with a special function dbms_lob.compare
// The V2ExpressionSQLBuilder can not support rewriting the '=' operator.
// We don't test binary data on Oracle and do not support binary data on Oracle.
if (!this.isInstanceOf[OracleIntegrationSuite]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case skips the Oracle, but the users still possible use this case.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that requires a significant change in how V2ExpressionSQLBuilder works. The binary literal was not working before this PR anyway. We can make it work at least on other databases this time. And propose another design changes to V2ExpressionSQLBuilder to make it flexible enable to support cases like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Oracle is supported, the test is now enabled for Oracle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a good idea if the bug still exists when using oracle dialect. Is there a way to fix the bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that requires a significant change in how V2ExpressionSQLBuilder works. The binary literal was not working before this PR anyway. We can make it work at least on other databases this time. And propose another design changes to V2ExpressionSQLBuilder to make it flexible enable to support cases like this.

Why we need change V2ExpressionSQLBuilder? Please describe the detail.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion; I didn't check the inheritance structure initially. After realizing that each dialect embeds a builder inherited from V2ExpressionSQLBuilder, I made all the changes in OracleDialect. PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: I just changed test to verify the returned result in addition. It's going to run for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests passed, but downloading report failed. We can rerun the whole test again to clear all the checks, but it takes quite some time to finish.
https://github.com/sunxiaoguang/spark/actions/runs/12744731853

// Create a table with binary column
val binary = "X'123456'"

sql(s"CREATE TABLE $tableName (binary_col BINARY)")
sql(s"INSERT INTO $tableName VALUES ($binary)")

val select = s"SELECT * FROM $tableName WHERE binary_col = $binary"
assert(spark.sql(select).collect().length === 1, s"Binary literal test failed: $select")
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the similar test case into JDBCV2Suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prepare data at tablePreparation.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tablePreparation

Sorry, I'm not quite familiar with the test infrastructure. In case I make mistakes, let me confirm this question.

To mixin the tablePreparation and dataPreparation from trait defined in V2JDBCTest.scala, we need to update all the integration tests and call the these functions defined in trait.

And duplicate the extra call to multiple integration tests is OK, am I right?

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, Just realized I have to use Spark SQL to create table and use the types defined in Spark SQL. If I prepare table and data in tablePreparation and dataPreparation, that will have to be database specific. The code will definitely have to be duplicated for connectors of all the databases.

Copy link
Contributor

@beliefer beliefer Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If we update the basic class JdbcDialect, we should test all the built-in integration tests.
tablePreparation used to customize the DDL, I'm afraid Spark SQL can covers all the built-in integration tests. But you could do your best effort, let's see the result and make the decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, just realized each dialect embeds a builder which can override the implementation. Let me have a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oracle support is ready for review, PTAL. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can override visitBinaryComparison in OracleSQLBuilder.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple threads discussing this topic. Let me copy the comment in case it's missed.

We can only rewrite comparison when one of the arguments is BLOB. For other cases, we have to use existing implementation. But unfortunately, the signature of visitBinaryComparison is accepting everything in string which loss the type information to understand if one of the arguments is binary type.

  protected String visitBinaryComparison(String name, String l, String r) {
    if (name.equals("<=>")) {
      return "((" + l + " IS NOT NULL AND " + r + " IS NOT NULL AND " + l + " = " + r + ") " +
              "OR (" + l + " IS NULL AND " + r + " IS NULL))";
    }
    return l + " " + name + " " + r;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests passed, but downloading report failed. We can rerun the whole test again to clear all the checks, but it takes quite some time to finish.
https://github.com/sunxiaoguang/spark/actions/runs/12744731853

}
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,12 @@ private case class DB2Dialect() extends JdbcDialect with SQLConfHelper with NoLe
}
}

override def compileValue(value: Any): Any = value match {
case binaryValue: Array[Byte] =>
binaryValue.map("%02X".format(_)).mkString("BLOB(X'", "", "')")
case other => super.compileValue(other)
}

override def getCatalystType(
sqlType: Int,
typeName: String,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,7 @@ abstract class JdbcDialect extends Serializable with Logging {
case dateValue: Date => "'" + dateValue + "'"
case dateValue: LocalDate => s"'${DateFormatter().format(dateValue)}'"
case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
case binaryValue: Array[Byte] => binaryValue.map("%02X".format(_)).mkString("X'", "", "'")
case _ => value
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ private case class MsSqlServerDialect() extends JdbcDialect with NoLegacyJDBCErr
// scalastyle:on line.size.limit
override def compileValue(value: Any): Any = value match {
case booleanValue: Boolean => if (booleanValue) 1 else 0
case binaryValue: Array[Byte] => binaryValue.map("%02X".format(_)).mkString("0x", "", "")
case other => super.compileValue(other)
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,12 @@ private case class PostgresDialect()
}
}

override def compileValue(value: Any): Any = value match {
case binaryValue: Array[Byte] =>
binaryValue.map("%02X".format(_)).mkString("'\\x", "", "'::bytea")
case other => super.compileValue(other)
}

override def supportsLimit: Boolean = true

override def supportsOffset: Boolean = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3097,4 +3097,19 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel
assert(rows.contains(Row(null)))
assert(rows.contains(Row("a a a")))
}

test("SPARK-50792 Format binary data as a binary literal in JDBC.") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for dialect-specific fixes, it's sufficient to put the tests V2JDBCTest.scala. This test suite is for testing the shared code paths for all dialects.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this PR is somewhat complicated, it has changes in shared code paths to support Binary Literal. And also has changes in MsSqlServerDialect, OracleDialect and PostgresDialect. If this test is not necessary, we can remove it.

val tableName = "h2.test.binary_literal"
withTable(tableName) {
// Create a table with binary column
val binary = "X'123456'"

sql(s"CREATE TABLE $tableName (binary_col BINARY)")
sql(s"INSERT INTO $tableName VALUES ($binary)")

val select = s"SELECT * FROM $tableName WHERE binary_col = $binary"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets verify that this is actually pushed down - e.g. verify that there is no FilterExec present in plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verification added, PTAL. Thanks

assert(spark.sql(select).collect().length === 1, s"Binary literal test failed: $select")
}
}

}
Loading