This initialization action installs Apache Drill on a Google Cloud Dataproc cluster. The script will also start drillbits on all nodes of the cluster.
Check the variables set in the script to ensure they're to your liking.
-
Use the
gcloud
command to create a new cluster with Drill installed. Run one of the following commands depending on your desired cluster type.Standard cluster (requires Zookeeper init action)
REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/zookeeper/zookeeper.sh,gs://goog-dataproc-initialization-actions-${REGION}/drill/drill.sh
High availability cluster (Zookeeper comes pre-installed)
REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --num-masters 3 \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/drill/drill.sh
Single node cluster (Zookeeper is unnecessary)
REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --single-node \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/drill/drill.sh
-
Once the cluster has been created, Drillbits will start on all nodes. You can log into any node of the cluster to run Drill queries. Drill is installed in
/usr/lib/drill
(unless you change the setting) which contains abin
directory withsqlline
.
You can run the following to get into sqlline, the Drill CLI query tool:
/usr/lib/drill/bin/sqlline -u jdbc:drill:
Once in sqlline, you can see what storage plugins are available. Out of the box, this initialization action supports GCS (gs), HDFS (hdfs), local linux file system (dfs) and Hive (hive):
$ /usr/lib/drill/bin/sqlline -u jdbc:drill:
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
apache drill 1.9.0
"just drill it"
0: jdbc:drill:> show databases;
+---------------------+
| SCHEMA_NAME |
+---------------------+
| INFORMATION_SCHEMA |
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| gs.default |
| gs.root |
| hdfs.default |
| hdfs.root |
| hdfs.tmp |
| hive.default |
| sys |
+---------------------+
12 rows selected (3.943 seconds)
In order to use Drill on single node cluster, run /usr/lib/drill/bin/drill-embedded
or run sqlline with zk=local: /usr/lib/drill/bin/sqlline -u jdbc:drill:zk=local
.
- This script must be updated based on which Drill version you wish you install
- This script must be updated based on your Cloud Dataproc cluster
- Access to the Drill UI is possible via SSH forwarding to port 8047 on any node, or with a SOCKS proxy via SSH.
- By default your Drill query profiles are stored in GCS in your cluster's dataproc bucket, as returned by
/usr/share/google/get_metadata_value attributes/dataproc-bucket
. You can change this indrill.sh
.