Skip to content

loum/hive-on-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hive 3.1.2 on Spark 2.4.8 (on YARN) with Docker

Overview

Quick and easy way to get Hive on Spark (on YARN) with Docker. See Apache Hive on Spark docs for more information.

NOTE: Now with Livy support.

Lots happening here, but in short this repository will build you a Docker image that allows you to run Hive with Spark as the compute engine. Spark itself uses YARN as the resource manager which we leverage from the underlying Hadoop install. See documentation on the Hive base Docker image for details on how Hadoop/YARN has been configured.

Quick Links

Quick Start

Impatient and just want Hive on Spark quickly?

docker run --rm -d --name hive-on-spark loum/hive-on-spark:latest

Prerequisties

Getting Started

Get the code and change into the top level git project directory:

git clone https://github.com/loum/hive-on-spark.git && cd hive-on-spark

NOTE: Run all commands from the top-level directory of the git repository.

For first-time setup, get the Makester project:

git submodule update --init

Keep Makester project up-to-date with:

make submodule-update

Setup the environment:

make init

Getting Help

There should be a make target to get most things done. Check the help for more information:

make help

Docker Image Management

Image Build

The image build compiles Spark from scratch to ensure we get the correct version without the YARN libraries. More info can be found at the Spark build page.

To build the Docker image:

make build-image

Image Searches

Search for existing Docker image tags with command:

make search-image

Image Tagging

By default, makester will tag the new Docker image with the current branch hash. This provides a degree of uniqueness but is not very intuitive. That's where the tag-version Makefile target can help. To apply tag as per project tagging convention <hive-version>-<spark-version>-<image-release-number>

make tag-version

To tag the image as latest

make tag-latest

Interact with Hive on Spark

To start the container and wait for all Hadoop services to initiate:

make controlled-run

To stop the container:

make stop

Start a shell on the Container

make login

Using Beeline CLI (HiveServer2)

NOTE: Check the Beeline Command Reference for more information.

Login to beeline (!q to exit CLI):

make beeline

Create a Hive table named test:

make beeline-create

To show tables:

make beeline-show

To insert a row of data into Hive table test

NOTE: This will invoke the Spark execution engine through YARN.

make beeline-insert

To select all rows in Hive table test:

make beeline-select

To drop the Hive table test:

make beeline-drop

Alternatively, port 10000 is exposed to allow connectivity to clients with JDBC.

Only Need Spark?

The Spark computing system_ is available and can be invoked as per normal. More information on submitting applications to Spark can be found here.

Sample SparkPi Application

The sample SparkPi application can be launched with:

make pi

Apart from some verbose logging displayed on the console it may appear that not much has happened here. However, since the Spark application has been deployed in cluster mode you will need to dump the associated application ID's log to see meaningful output. To get a list of Spark application logs (under YARN):

make yarn-apps

Then plug in an Application-Id into:

make yarn-app-log YARN_APPLICATION_ID=<Application-Id>

To see something similar to the following::

====================================================================
LogType:stdout
LogLastModifiedTime:Sat Apr 11 21:49:03 +0000 2020
LogLength:33
LogContents:
Pi is roughly 3.1398156990784956

End of LogType:stdout
***********************************************************************

pyspark

make pyspark

spark-shell

make spark-shell

Web Interfaces

The following web interfaces are available to view configurations and logs and to track YARN/Spark job submissions:

About

Hive on Spark (using YARN) with Docker

Resources

Stars

Watchers

Forks

Packages

No packages published