Skip to content
/ blaze Public

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

License

Notifications You must be signed in to change notification settings

kwai/blaze

 
 

Repository files navigation

BLAZE

TPC-DS master-ce7-builds

The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines the power of the Apache Arrow-DataFusion library and the scale of the Spark distributed computing framework.

Blaze takes a fully optimized physical plan from Spark, mapping it into DataFusion's execution plan, and performs native plan computation in Spark executors.

Blaze is composed of the following high-level components:

  • Spark Extension: hooks the whole accelerator into Spark execution lifetime.
  • Spark Shims: specialized codes for different versions of spark.
  • Native Engine: implements the native engine in rust, including:
    • ExecutionPlan protobuf specification
    • JNI gateway
    • Customized operators, expressions, functions

Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:

  • Various object stores.
  • Operators.
  • Simple and Aggregate functions.
  • File formats.

We encourage you to extend DataFusion capability directly and add the supports in Blaze with simple modifications in plan-serde and extension translation.

Build from source

To build Blaze, please follow the steps below:

  1. Install Rust

The native execution lib is written in Rust. So you're required to install Rust (nightly) first for compilation. We recommend you to use rustup.

  1. Install JDK+Maven

Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.

  1. Check out the source code.
git clone [email protected]:blaze-init/blaze.git
cd blaze
  1. Build the project.

Specify shims package of which spark version that you would like to run on. _Currently we have supported these shims:

  • spark303 - for spark3.0.x
  • spark313 - for spark3.1.x
  • spark324 - for spark3.2.x
  • spark333 - for spark3.3.x
  • spark351 - for spark3.5.x.

You could either build Blaze in dev mode for debugging or in release mode to unlock the full potential of Blaze.

SHIM=spark333 # or spark303/spark313/spark320/spark324/spark333/spark351
MODE=release # or pre
mvn package -P"${SHIM}" -P"${MODE}"

After the build is finished, a fat Jar package that contains all the dependencies will be generated in the target directory.

Build with docker

You can use the following command to build a centos-7 compatible release:

SHIM=spark333 MODE=release ./release-docker.sh

Run Spark Job with Blaze Accelerator

This section describes how to submit and configure a Spark Job with Blaze support.

  1. move blaze jar package to spark client classpath (normally spark-xx.xx.xx/jars/).

  2. add the follow confs to spark configuration in spark-xx.xx.xx/conf/spark-default.conf:

spark.blaze.enable true
spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
spark.memory.offHeap.enabled false

# suggested executor memory configuration
spark.executor.memory 4g
spark.executor.memoryOverhead 4096
  1. submit a query with spark-sql, or other tools like spark-thriftserver:
spark-sql -f tpcds/q01.sql

Performance

Check Benchmark Results with the latest date for the performance comparison with vanilla Spark 3.3.3. The benchmark result shows that Blaze save about 50% time on TPC-DS/TPC-H 1TB datasets. Stay tuned and join us for more upcoming thrilling numbers.

TPC-DS Query time: (How can I run TPC-DS benchmark?) 20240701-query-time-tpcds

TPC-H Query time: 20240701-query-time-tpch

We also encourage you to benchmark Blaze and share the results with us. 🤗

Community

We're using Discussions to connect with other members of our community. We hope that you:

  • Ask questions you're wondering about.
  • Share ideas.
  • Engage with other community members.
  • Welcome others who are open-minded. Remember that this is a community we build together 💪 .

License

Blaze is licensed under the Apache 2.0 License. A copy of the license can be found here.