Cloudera Machine Learning Virtual Machine

Cloudera Quickstart 5.12 Virtual Machine provisioned with Machine Learning and streaming tools.

This instance shows how to provide the CDH virtual machine with any other tools that could be required for Machine Learning purposes.

I aim to train with a unique machine in a pseudodistributed cluster, the first step Cloudera proposes, just for educational use.

Contents of Cloudera distribution of Apache Hadoop (CDH).

The Cloudera Quickstart 5.12 Virtual Machine includes:

Apache HBase
Apache Hive & Hive on Spark
Apache Impala
Apache Oozie
Apache Solr
Apache Spark
Apache Squoop 1 & 2
Apache YARN
Apache Zookeper
HDFS
Hue 4
... Other open source available applications are:
Cloudera Manager
Cloudera Navigator
Cloudera Search

Contents of "my" CDH4ML

Based on CDH 5.12.0, I have added this tools:

Anaconda 4.3.1 distribution
Spark 2
Get the integration of IPython Notebook with Apache Spark
JupyterHub
Java 1.8
Java 1.8.
Kafka 3.0.0
Apache Flink 1.0.3
Apache Flume
...

About

Anaconda in Cloudera CDH

Anaconda empowers the entire data science team - data engineers, data scientists, and business analysts - to analyze data in Hadoop and deliver high value, high impact predictive and machine learning solutions with Python.

Anaconda can be installed on a CDH cluster as a parcel.

With JupyterHub you can create a multi-user Hub which spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server.

Project Jupyter created JupyterHub to support many users. The Hub can offer notebook servers to a class of students, a corporate data science workgroup, a scientific research project, or a high performance computing group.

Python 3.4.3

Instalation

Anaconda on Cloudera CDH

There are two methods of using Anaconda on an existing cluster with Cloudera CDH, Cloudera’s distribution including Apache Hadoop:

Use the Anaconda parcel for Cloudera CDH. The following procedure describes how to install the Anaconda parcel on a CDH cluster using Cloudera Manager. The Anaconda parcel provides a static installation of Anaconda, based on Python 2.7, that can be used with Python and PySpark jobs on the cluster.
Use Anaconda Scale, which provides additional functionality, including the ability to manage multiple conda environments and packages, including Python and R, alongside an existing CDH cluster. For more information, see Using Anaconda with Cloudera CDH.

I am going to use the first method. Basically because is open source, althought we have several limitations that I will try to solve.

To install the Anaconda parcel:

In the Cloudera Manager Admin Console, in the top navigation bar, click the Parcels icon.
At the top right of the parcels page, click the Edit Settings button.
In the Remote Parcel Repository URLs section, click the plus symbol, and then add the following repository URL for the Anaconda parcel:
At the top of the page, click the Save Changes button.
In the top navigation bar, click the Parcels icon to return to the list of available parcels, where you should see the latest version of the Anaconda parcel that is available.
To the right of the Anaconda parcel listing, click the Download button.
After the parcel is downloaded, click the Distribute button to distribute the parcel to all of the cluster nodes.
After the parcel is distributed, click the Activate button to activate the parcel on all of the cluster nodes.
When prompted, confirm the activation.

After the parcel is activated, Anaconda is available on all of the cluster nodes.

You can submit Spark jobs along with the PYSPARK_PYTHON environment variable that refers to the location of Anaconda. For example, enter the following command all on one line:

PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit pyspark_script.py

NOTE: The line break in the example above is for readability only. Enter the command all on one line.

NOTE: The repository URL shown above installs the most recent version of the Anaconda parcel. To install an older version of the Anaconda parcel, add the following repository URL to the Remote Parcel Repository URLs in Cloudera manager, and then follow the above steps with your desired version of the Anaconda parcel.

Anaconda builds new Cloudera parcels at least once a year each spring and also offers custom parcel creation for our enterprise customers. The Anaconda parcel provided at the repository URL shown above is based on Python 2.7.

Installing Apache Spark 2 on Cloudera CDH

I followed the steps to install Spark 2:

See Spark 2 Requirements and check them.
Install the Spark 2 CSD into Cloudera Manager.

Log on to the Cloudera Manager Server host, and place the Spark 2 CSD file in the location configured for CSD files. Set the file ownership of the CSD file to cloudera-scm:cloudera-scm with permission 644. Or what is the same, a little more slowly:

a. Download the Spark 2 CSD . I choose the last version (version 2.2, relase 2).

b. Upload the CSD to /opt/cloudera/csd in the Cloudera Manager server.

c. Change the owner and group for the JAR:

 sudo chown cloudera-scm:cloudera-scm /opt/cloudera/csd/SPARK2_ON_YARN-2.2.0.cloudera2.jar

d. Update the permissions on the file:

  sudo chmod 644 /opt/cloudera/csd/SPARK2_ON_YARN-2.2.0.cloudera2.jar

e. Restart the Cloudera Manager server: As the root user on the Cloudera Manager server, run

   sudo service cloudera-scm-server restart

Log in to the Cloudera Manager Admin Console and restart the Cloudera Manager Service with:

   sudo service cloudera-scm-agent restart

f. Check whether the CSD successfully installed in http://quickstart.cloudera:7180/cmf/csd/refresh. Search for the following entry:

{
"csdName":"SPARK2_ON_YARN-2.2.0.cloudera2",
"serviceType":"SPARK2_ON_YARN",
"source":"/opt/cloudera/csd/SPARK2_ON_YARN-2.2.0.cloudera2.jar",
"isInstalled":true
}

g. Restart the Cloudera Manager Server with the following command:

  sudo service cloudera-scm-server restart

In the Cloudera Manager Admin Console, add the Spark2 parcel repository to the Remote Parcel Repository URLs in Parcel Settings as described in remote repository URLs.
Download the Spark 2 parcel, distribute the parcel to the hosts in your cluster, and activate the parcel. See Managing Parcels.
Add the Spark 2 service to your cluster.

a. In the step #1, select a dependency option:

HDFS, YARN, ZooKeeper: Choose this option if you do not need access to a Hive service.
HDFS, Hive, YARN, ZooKeeper: Hive is an optional dependency for the Spark service. If you have a Hive service and want to access Hive tables from your Spark applications, choose this option to include Hive as a dependency and have the Hive client configurations always available to Spark applications.

I chose the second option. Without Hive dependencies.

b. In the step #2, when customizing the role assignments for Spark 2, DO NOT ADD the gateway role to every host. Inf fact, in our case is not needed. We do not have a NODO FRONTERA, because we do not have a "real" cluster. Thas why the only ...

c. Note that the History Server port is 18089 instead of the usual 18088. CHECK IT

d. Complete the steps to add the Spark 2 service.

e. Return to the Home page by clicking the Cloudera Manager logo.

f. Click to restart the cluster.

Installing Java 8 (Oracle JDK)

The Oracle JDK installer is available both as an RPM-based installer for RPM-based systems, and as a binary installer for other systems.

Download the .tar.gz file for one of the supported versions of the Oracle JDK from Java SE 8 Downloads.
Extract the JDK to /usr/java/jdk.1.8.
Set JAVA_HOME to the directory where the JDK is installed. Add the following line to the specified files:

export JAVA_HOME=/usr/java/jdk.1.8.0_nn

Cloudera Manager Server host: /etc/default/cloudera-scm-server. This affects only the Cloudera Manager Server process, and does not affect the Cloudera Management Service roles.
All hosts in an unmanaged deployment (!!): /etc/default/bigtop-utils. You do not need to do this for clusters managed by Cloudera Manager.

I followed the instructions in Configuring a Custom Java Home Location . This change affects all CDH processes and Cloudera Management Service roles in the cluster.

References & more info

Pay attention to the versions, because these last three posts could be a little deprecated.

CDH 5 and Cloudera Manager 5 Requirements and Supported Versions

Even more

Some ideas

<a href="https://gist.github.com/ciprian12/f4333b92f3b8ae5373f5b3ef91bed158" Installing zeppelin 0.6.1 with cdh 5.7.0 on ubuntu 14.04 lts

The main point we should take care would be a matter of versions, specially with Python. To sum up, I could suggest a parcial solution could be install Python 3 manually. One installation per node. We aim to work and visualize notebooks on a mutli-user environment with Jupyther Hub.

Checking versions:

Centos Version

$ rpm --query centos-release

or...

$ lsb_release -d

Java version

java -version

Spark version

spark-shell --version

Python version

python --version

Kafka version

There is nothing like kafka --version at this point. So you should either check the version from your /usr/lib/kafka/cloudera folder.

grep "version" /usr/lib/kafka/cloudera/*

The main problem I try to solve is that we will find several versions installed within the vm, and we must point to one or another version according to our use.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloudera Machine Learning Virtual Machine

Contents of Cloudera distribution of Apache Hadoop (CDH).

Contents of "my" CDH4ML

About

Instalation

Anaconda on Cloudera CDH

Installing Apache Spark 2 on Cloudera CDH

Installing Java 8 (Oracle JDK)

References & more info

Even more

Some ideas

Checking versions:

Centos Version

Java version

Spark version

Python version

Kafka version

About

Releases

Packages

PalomaCue/cloudera-ml-vm

Folders and files

Latest commit

History

Repository files navigation

Cloudera Machine Learning Virtual Machine

Contents of Cloudera distribution of Apache Hadoop (CDH).

Contents of "my" CDH4ML

About

Instalation

Anaconda on Cloudera CDH

Installing Apache Spark 2 on Cloudera CDH

Installing Java 8 (Oracle JDK)

References & more info

Even more

Some ideas

Checking versions:

Centos Version

Java version

Spark version

Python version

Kafka version

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages