Skip to content

Latest commit

 

History

History
 
 

spark

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Getting Started with Apache Spark and Apache Polaris

This getting started guide provides a docker-compose file to set up Apache Spark with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark. A Jupyter notebook is used to run PySpark.

Run the docker-compose file

To start the docker-compose file, run this command from the repo's root directory:

docker-compose -f getting-started/spark/docker-compose.yml up 

This will spin up 2 container services

  • The polaris service for running Apache Polaris using an in-memory metastore
  • The jupyter service for running Jupyter notebook with PySpark

Access the Jupyter notebook interface

In the Jupyter notebook container log, look for the URL to access the Jupyter notebook. The url should be in the format, http://127.0.0.1:8888/lab?token=<token>.

Open the Jupyter notebook in a browser. Navigate to notebooks/SparkPolaris.ipynb

Change the Polaris credential

The Polaris service will create a new root crendential on startup, find this credential in the Polaris service log and change the polaris_credential variable in the first cell of the jupyter notebook

Run the Jupyter notebook

You can now run all cells in the notebook or write your own code!