Sample command-line programs for interacting with the Cloud Dataproc API.
Note that while this sample demonstrates interacting with Dataproc via the API, the functionality demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI.
list_clusters.py
is a simple command-line program to demonstrate connecting to the
Dataproc API and listing the clusters in a region
create_cluster_and_submit_job.py
demonstrates how to create a cluster, submit the
pyspark_sort.py
job, download the output from Google Cloud Storage, and output the result.
Go to the Google Cloud Console.
Under API Manager, search for the Google Cloud Dataproc API and enable it.
To install, run the following commands. If you want to use virtualenv (recommended), run the commands within a virtualenv.
* pip install -r requirements.txt
Create local credentials by running the following command and following the oauth2 flow:
gcloud beta auth application-default login
To run list_clusters.py:
python list_clusters.py --project_id=<YOUR-PROJECT-ID> --zone=us-central1-b
To run create_cluster_and_submit_job, first create a GCS bucket, from the Cloud Console or with gsutil:
gsutil mb gs://<your-input-bucket-name>
Then run:
python create_cluster_and_submit_job.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name>
This will setup a cluster, upload the PySpark file, submit the job, print the result, then delete the cluster.
You can optionally specify a --pyspark_file
argument to change from the default
pyspark_sort.py
included in this script to a new script.
On Google App Engine, the credentials should be found automatically.
On Google Compute Engine, the credentials should be found automatically, but require that you create the instance with the correct scopes.
gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
If you did not create the instance with the right scopes, you can still upload a JSON service
account and set GOOGLE_APPLICATION_CREDENTIALS
. See Google Application Default Credentials for more details.