Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.
The play-hadoop.yml
deploys the base system. Addons, such as Pig,
Spark, etc, are deployed using the playbooks in the addons
directory.
Legend:
- available
- planned
- git
- GitHub account with uploaded SSH keys (due to use of submodules)
- Python, pip, virtualenv, libffi-dev, pkg-config
- Nodes accessible by SSH to admin-privileged account
-
Clone this repository (you must have a GitHub account and uploaded your SSH)
$ git clone --recursive git://github.com/futuresystems/big-data-stack.git $ cd big-data-stack
-
Create a virtualenv
$ virtualenv venv && source venv/bin/activate
-
Install the dependencies
(venv) $ pip install -r requirements.txt
-
Generate the inventory file
(venv) $ python mk-inventory -n bds- 10.0.0.10 10.0.0.11 >inventory.txt
-
Sanity check
(venv) $ ansible all -m ping
If this fails, ensure that the nodes are SSH-accessible and that the user is correct in
ansible.cfg
(alternatively, override using the-u $REMOTE_USERNAME
) flag. You can pass-v
to increase verbosity (add multiple for more details eg-vvvv
). -
Deploy
(venv) $ ansible-playbook play-hadoop.yml addons/spark.yml # ... etc
-
Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on
india
, Ansible may be unable to access the node and complain with something like:master0 | UNREACHABLE! => { "changed": false, "msg": "ssh [email protected]:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.", "unreachable": true }
To start the agent:
badi@i136 ~$ eval $(ssh-agent) badi@i136 ~$ ssh-add
-
Make sure your public key is added to github.com IMPORTANT check the fingerprint
ssh-keygen -lf ~/.ssh/id_rsa
and make sure it is in your list of keys! -
Download this repository using
git clone --recursive
. IMPORTANT: make sure you specify the--recursive
option otherwise you will get errors.git clone --recursive https://github.com/futuresystems/big-data-stack.git
-
Install the requirements using
pip install -r requirements.txt
-
Launch a virtual cluster and obtain the SSH-able IP addresses
-
Generate the inventory and variable files using
./mk-inventory
For example:./mk-inventory -n $USER-mycluster 192.168.10{1,2,3,4} >inventory.txt
Will define the inventory for a four-node cluster which nodes names as
$USER-myclusterN
(withN
from0..3
) -
Make sure that
ansible.cfg
reflects your environment. Look especially atremote_user
if you are not using Ubuntu. You can alternatively override the user by passing-u $NODE_USERNAME
to the ansible commands. -
Ensure
ssh_config
is to your liking. -
Run
ansible all -m ping
to make sure all nodes can be managed. -
Run
ansible-playbook play-hadoop.yml
to install the base system -
Run
ansible-playbook addons/{pig,spark}.yml # etc
to install the Pig and Spark addons. -
Log into the frontend node (see the
[frontends]
group in the inventory) and use thehadoop
user (sudo su - hadoop
) to run jobs on the cluster.
Sidenote: you may want to pass the -f <N>
flag to ansible-playbook
to use N
parallel connections.
This will make the deployment go faster.
For example:
$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...
The hadoop
user is present on all the nodes and is the hadoop administrator.
If you need to change anything on HDFS, it must be done as hadoop
.
Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:
$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt
See the examples
directory:
nist_fingerprint
: fingerprint analysis using Spark with results pushed to HBase
Please see the LICENSE
file in the root directory of the repository.
- Fork the repository
- Add yourself to the
CONTRIBUTORS.yml
file - Submit a pull request to the
unstable
branch