Skip to content

Latest commit

 

History

History
 
 

fuzz-testing

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Comet Fuzz

Comet Fuzz is a standalone project for generating random data and queries and executing queries against Spark with Comet disabled and enabled and checking for incompatibilities.

Although it is a simple tool it has already been useful in finding many bugs.

Comet Fuzz is inspired by the SparkFuzz paper from Databricks and CWI.

Roadmap

Planned areas of improvement:

  • ANSI mode
  • Support for all data types, expressions, and operators supported by Comet
  • IF and CASE WHEN expressions
  • Complex (nested) expressions
  • Literal scalar values in queries
  • Add option to avoid grouping and sorting on floating-point columns
  • Improve join query support:
    • Support joins without join keys
    • Support composite join keys
    • Support multiple join keys
    • Support join conditions that use expressions

Usage

Build the jar file first.

mvn package

Set appropriate values for SPARK_HOME, SPARK_MASTER, and COMET_JAR environment variables and then use spark-submit to run CometFuzz against a Spark cluster.

Generating Data Files

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --class org.apache.comet.fuzz.Main \
    target/comet-fuzz-spark3.4_2.12-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
    data --num-files=2 --num-rows=200 --num-columns=100

There is an optional --exclude-negative-zero flag for excluding -0.0 from the generated data, which is sometimes useful because we already know that we often have different behavior for this edge case due to differences between Rust and Java handling of this value.

Generating Queries

Generate random queries that are based on the available test files.

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --class org.apache.comet.fuzz.Main \
    target/comet-fuzz-spark3.4_2.12-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
    queries --num-files=2 --num-queries=500

Note that the output filename is currently hard-coded as queries.sql

Execute Queries

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
    --conf spark.comet.enabled=true \
    --conf spark.comet.exec.enabled=true \
    --conf spark.comet.exec.all.enabled=true \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    --conf spark.comet.exec.shuffle.enabled=true \
    --conf spark.comet.exec.shuffle.mode=auto \
    --jars $COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --class org.apache.comet.fuzz.Main \
    target/comet-fuzz-spark3.4_2.12-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
    run --num-files=2 --filename=queries.sql

Note that the output filename is currently hard-coded as results-${System.currentTimeMillis()}.md