- Overview
- Quick Links
- Quick Start
- Prerequisites
- Getting Started
- Getting Help
- Docker Image Management
- Interact with Hive on Spark
- Only Need Spark?
- Web Interfaces
Quick and easy way to get Hive on Spark (on YARN) with Docker. See Apache Hive on Spark docs for more information.
NOTE: Now with Livy support.
Lots happening here, but in short this repository will build you a Docker image that allows you to run Hive with Spark as the compute engine. Spark itself uses YARN as the resource manager which we leverage from the underlying Hadoop install. See documentation on the Hive base Docker image for details on how Hadoop/YARN has been configured.
Impatient and just want Hive on Spark quickly?
docker run --rm -d --name hive-on-spark loum/hive-on-spark:latest
Get the code and change into the top level git
project directory:
git clone https://github.com/loum/hive-on-spark.git && cd hive-on-spark
NOTE: Run all commands from the top-level directory of the
git
repository.
For first-time setup, get the Makester project:
git submodule update --init
Keep Makester project up-to-date with:
make submodule-update
Setup the environment:
make init
There should be a make
target to get most things done. Check the help for more information:
make help
The image build compiles Spark from scratch to ensure we get the correct version without the YARN libraries. More info can be found at the Spark build page.
To build the Docker image:
make build-image
Search for existing Docker image tags with command:
make search-image
By default, makester
will tag the new Docker image with the current branch hash. This provides a degree of uniqueness but is not very intuitive. That's where the tag-version
Makefile
target can help. To apply tag as per project tagging convention <hive-version>-<spark-version>-<image-release-number>
make tag-version
To tag the image as latest
make tag-latest
To start the container and wait for all Hadoop services to initiate:
make controlled-run
To stop the container:
make stop
make login
NOTE: Check the Beeline Command Reference for more information.
Login to beeline
(!q
to exit CLI):
make beeline
Create a Hive table named test
:
make beeline-create
To show tables:
make beeline-show
To insert a row of data into Hive table test
NOTE: This will invoke the Spark execution engine through YARN.
make beeline-insert
To select all rows in Hive table test
:
make beeline-select
To drop the Hive table test
:
make beeline-drop
Alternatively, port 10000
is exposed to allow connectivity to clients with JDBC.
The Spark computing system_ is available and can be invoked as per normal. More information on submitting applications to Spark can be found here.
The sample SparkPi application can be launched with:
make pi
Apart from some verbose logging displayed on the console it may appear that not much has happened here. However, since the Spark application has been deployed in cluster mode you will need to dump the associated application ID's log to see meaningful output. To get a list of Spark application logs (under YARN):
make yarn-apps
Then plug in an Application-Id
into:
make yarn-app-log YARN_APPLICATION_ID=<Application-Id>
To see something similar to the following::
====================================================================
LogType:stdout
LogLastModifiedTime:Sat Apr 11 21:49:03 +0000 2020
LogLength:33
LogContents:
Pi is roughly 3.1398156990784956
End of LogType:stdout
***********************************************************************
make pyspark
make spark-shell
The following web interfaces are available to view configurations and logs and to track YARN/Spark job submissions:
- YARN NameNode web UI: http://localhost:8042
- YARN ResourceManager web UI: http://localhost:8088
- Spark History Server web UI: http://localhost:18080
- HiveServer2 web UI: http://localhost:10002