Skip to content

Latest commit

 

History

History
 
 

sst-data-source

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Quick Start

Spark SST Data Source enables users to decode SST files generated by RawKV backup to Key-Value pairs using Spark.

Install tikv-client-java

git clone [email protected]:tikv/client-java.git
mvn --file client-java/pom.xml clean install -DskipTests

Build sst-data-source project

git clone [email protected]:tikv/migration.git
cd migration
mvn clean package -DskipTests -am -pl sst-data-source

Export SST

br backup raw \
--pd 127.0.0.1:2379 \
--storage "hdfs:///path/to/sst/" \
--start s \
--end t \
--format raw \
--cf default

Run SSTDataSourceExample

spark-submit \
--master local[*] \
--jars /path/to/tikv-client-java-3.3.0-SNAPSHOT.jar \
--class org.tikv.datasources.sst.example.SSTDataSourceExample \
sst-data-source/target/sst-data-source-0.0.1-SNAPSHOT.jar \
hdfs:///path/to/sst/

Call Spark SST Data Source

Also we can write a self-contained application to decode sst files.

  def main(args: Array[String]): Unit = {
    val sstFilePath = "hdfs:///path/to/sst/"
    val df = spark.read
      .format("sst")
      .load(sstFilePath)
    df.printSchema()
    df.count()
    df.show(false)
  }

The output of df.printSchema() is as follows:

root
 |-- key: binary (nullable = false)
 |-- value: binary (nullable = true)

Parameters

Key Default Value Description
path - The path to the SST Files, e.g. hdfs:/path/to/sst/
enable-ttl false Whether the TiKV Cluster enables ttl

Spark Version

Default Spark version is 3.0.2. If you want to use other Spark version, please compile with the following command:

mvn clean package -DskipTests -Dspark.version.compile=3.1.1

Develop

To format the code, please run mvn mvn-scalafmt_2.12:format or mvn clean package -DskipTests.

Documents