Skip to content

futuresystems/big-data-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Analytics Stack

Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.

The play-hadoop.yml deploys the base system. Addons, such as Pig, Spark, etc, are deployed using the playbooks in the addons directory.

Stack

Legend:

  • available
  • planned

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

Storage

Monitoring

Requirements

  • git
  • GitHub account with uploaded SSH keys (due to use of submodules)
  • Python, pip, virtualenv, libffi-dev, pkg-config
  • Nodes accessible by SSH to admin-privileged account

Quickstart

  • Clone this repository (you must have a GitHub account and uploaded your SSH)

    $ git clone --recursive git://github.com/futuresystems/big-data-stack.git
    $ cd big-data-stack
    
  • Create a virtualenv

    $ virtualenv venv && source venv/bin/activate
    
  • Install the dependencies

    (venv) $ pip install -r requirements.txt
    
  • Generate the inventory file

    (venv) $ python mk-inventory -n bds- 10.0.0.10 10.0.0.11 >inventory.txt
    
  • Sanity check

    (venv) $ ansible all -m ping
    

    If this fails, ensure that the nodes are SSH-accessible and that the user is correct in ansible.cfg (alternatively, override using the -u $REMOTE_USERNAME) flag. You can pass -v to increase verbosity (add multiple for more details eg -vvvv).

  • Deploy

    (venv) $ ansible-playbook play-hadoop.yml addons/spark.yml    # ... etc
    

Usage

  1. Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on india, Ansible may be unable to access the node and complain with something like:

    master0 | UNREACHABLE! => {
        "changed": false,
        "msg": "ssh [email protected]:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.",
        "unreachable": true
    }
    

    To start the agent:

    badi@i136 ~$ eval $(ssh-agent)
    badi@i136 ~$ ssh-add
    
  2. Make sure your public key is added to github.com IMPORTANT check the fingerprint ssh-keygen -lf ~/.ssh/id_rsa and make sure it is in your list of keys!

  3. Download this repository using git clone --recursive. IMPORTANT: make sure you specify the --recursive option otherwise you will get errors.

     git clone --recursive https://github.com/futuresystems/big-data-stack.git
    
  4. Install the requirements using pip install -r requirements.txt

  5. Launch a virtual cluster and obtain the SSH-able IP addresses

  6. Generate the inventory and variable files using ./mk-inventory For example:

    ./mk-inventory -n $USER-mycluster 192.168.10{1,2,3,4} >inventory.txt
    

    Will define the inventory for a four-node cluster which nodes names as $USER-myclusterN (with N from 0..3)

  7. Make sure that ansible.cfg reflects your environment. Look especially at remote_user if you are not using Ubuntu. You can alternatively override the user by passing -u $NODE_USERNAME to the ansible commands.

  8. Ensure ssh_config is to your liking.

  9. Run ansible all -m ping to make sure all nodes can be managed.

  10. Run ansible-playbook play-hadoop.yml to install the base system

  11. Run ansible-playbook addons/{pig,spark}.yml # etc to install the Pig and Spark addons.

  12. Log into the frontend node (see the [frontends] group in the inventory) and use the hadoop user (sudo su - hadoop) to run jobs on the cluster.

Sidenote: you may want to pass the -f <N> flag to ansible-playbook to use N parallel connections. This will make the deployment go faster. For example:

$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...

The hadoop user is present on all the nodes and is the hadoop administrator. If you need to change anything on HDFS, it must be done as hadoop.

Upgrading

Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:

$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt

Examples

See the examples directory:

  • nist_fingerprint: fingerprint analysis using Spark with results pushed to HBase

License

Please see the LICENSE file in the root directory of the repository.

Contributing

  1. Fork the repository
  2. Add yourself to the CONTRIBUTORS.yml file
  3. Submit a pull request to the unstable branch