With the inclusion of the Cassandra Data Source, PySpark can now be used with the Connector to access Cassandra data. This does not require DataStax Enterprise but you are limited to DataFrame only operations.
To enable Cassandra access the Spark Cassandra Connector assembly jar must be included on both the driver and executor classpath for the PySpark Java Gateway. This can be done by starting the PySpark shell similarly to how the spark shell is started. The preferred method is now to use the maven artifact.
./bin/pyspark \
--packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
Spark allows you to manipulate external data with and without a Catalog. For a short intro and more details about Catalogs see Quick Start and Data Frames.
Loading a data set with DatasourceV2 requires creating a Catalog Reference to your Cassandra Cluster.
spark.conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
spark.read.table("myCatalog.myKs.myTab").show()
A DataFrame can be saved to an existing Cassandra table by using the the saveAsTable
method with a catalog, keyspace
and a table name specified.
spark.range(1, 10)\
.selectExpr("id as k")\
.write\
.mode("append")\
.partitionBy("k")\
.saveAsTable("myCatalog.myKs.myTab")
A DataFrame can be created which links to Cassandra by using the the org.apache.spark.sql.cassandra
source and by specifying keyword arguments for keyspace
and table
.
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
+-+-+
|k|v|
+-+-+
|5|5|
|1|1|
|2|2|
|4|4|
|3|3|
+-+-+
A DataFrame can be saved to an existing Cassandra table by using the the org.apache.spark.sql.cassandra
source and by specifying keyword arguments for keyspace
and table
and saving mode (append
, overwrite
, error
or ignore
, see Data Sources API doc).
df.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="kv", keyspace="test")\
.save()
The options and parameters are identical to the Scala Data Frames Api so please see Data Frames for more information.
Python does not support using periods(".") in variable names. This makes it
slightly more difficult to pass SCC options to the DataFrameReader. The options
function takes kwargs**
which means you can't directly pass in keys. There is a
workaround though. Python allows you to pass a dictionary as a representation of kwargs and dictionaries
can have keys with periods.
load_options = { "table": "kv", "keyspace": "test", "spark.cassandra.input.split.size_in_mb": "10"}
spark.read.format("org.apache.spark.sql.cassandra").options(**load_options).load().show()