gadsdc/20-map_reduce at master · ajschumacher/gadsdc

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
map.py	map.py
reduce.py	reduce.py
slides.pdf	slides.pdf
slides.pptx	slides.pptx

Before

Check out the original map-reduce paper from Google.

Optional:

Install VirtualBox so you can run virtual machines on your local computer. Then download an image with Hadoop set up for you to play with, such as Cloudera's QuickStart VM. Alternatively, you could install Hadoop on your local machine directly. There's a walk-through for installing it on a Mac with brew.

Questions

What takes more time: reading data from disk, or processing the data? (Make up an example or two if you like.) What is more likely to be a bottleneck?
Which algorithms can be applied in a streaming fashion? How could you extend streaming approaches to work with multiple streams at the same time?
What other thoughts, comments, concerns, and questions do you have? What's on your mind?

During

Application presentation.

Question review.

Check out Latency numbers every programmer should know. (Disk is slow!)

Slides on map-reduce.

Walk-through for doing map-reduce on Amazon Elastic MapReduce (EMR):

The AWS Command Line Interface (CLI)

Amazon provides an AWS CLI for interacting with many of their services, including S3. It installs easily with pip. You'll need an AWS account and an access key to configure it.

pip install awscli
aws configure

Now you can easily move files into and out of S3 buckets:

aws s3 cp myfile s3://mybucket
aws s3 sync s3://mybucket .

And so on. (See aws s3 help etc.)

Streaming map-reduce with Python

This example uses tweets as the data. The tweets were loaded into Python and then written to disk as stringified dicts. There are about 37 gigs of them at the gadsdc-twitter s3 bucket. A manageable chunk containing just 11 tweets is available: https://s3.amazonaws.com/gadsdc-twitter/out03.txt

Here are simple map and reduce scripts. You can run locally:

cat input | ./map.py | sort | ./reduce.py > output

You can run cluster streaming jobs on Amazon EMR through the AWS console.

More things to try implementing this way:

What were the most popular hashtags?
How many tweets came in per hour?
- Did the stream get rate-limited?
What tweets / which people were most re-tweeted?
Can you induce a graph from "conversations"?

There is a command line interface for Elastic Map Reduce as well, but it's a bit old, and depends on Ruby 1.8.7.

More abstraction

Pig lets you write Pig Latin scripts for doing complex map-reduce tasks more easily. Hortonworks has an introductory tutorial. Mortar has a tutorial as well. You can also run Pig on Amazon EMR.

PigPen "is map-reduce for Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it."

Hive adds some more structure to data and let's you write HiveQL, which is very close to SQL. You can also run Hive on Amazon EMR.

mrjob is a Python library from Yelp that wraps map-reduce and can run jobs on EMR.
Luigi is a Python library from Spotify that lets you write map-reduce workflows more easily.
Cascading is a layer on top of Hadoop that has further layers such as Scalding (Scala) from Twitter - yet another way to simplify working with map-reduce.
RHadoop provides an interface for running R on Hadoop.

There's also big graph processing as in Giraph, which is inspired by Google's Pregel.

Totally separate from Hadoop, MongoDB has an internal implementation of map-reduce.

Beyond Map-Reduce

Cloudera's Impala is inspired by Google's Dremel. Of course there's also Drill. And if you want to get Dremel straight from the source, you can buy it as a service from Google as BigQuery.

Spark keeps things in memory to be much faster. This is especially useful for iterative processes. See, for example, their examples, which feature their nice Python API. There's also Shark, which gives much faster HiveQL query performance. You can run Spark/Shark on EMR too.

There's also distributed stream processing as in Storm.

`sklearn` for huge data?

Not exactly. But there are some projects that step in that direction:

Mahout is a project for doing large scale machine learning. It was originally mostly map-reduce oriented, but in April 2014 announced a move toward Spark.

MLlib is the machine learning functionality directly on Spark, which is actively growing.

After

Optional:

UC Berkeley's AMP Camp provides great resources for learning a range of technologies including Spark. (Berkeley's AMP Lab is responsible for a lot of these cool technologies.)
HUE "Hadoop User Experience" "is a Web interface for analyzing data with Apache Hadoop" that you might find inside various vendors' platforms, or separately.
This paper describes large-scale machine learning in a very real-world advertising setting.
See also Ad Click Prediction: a View from the Trenches at Google.
Writing an Hadoop MapReduce Program in Python is a streaming walk-through that runs Hadoop directly.
An elastic-mapreduce streaming example with python and ngrams on AWS is another walk-through that uses the EMR CLI.
Check out an overview of algorithms over map-reduce.
For more on doing joins with map-reduce, see this thesis.
Read about doing ML faster by using more cores, using map-reduce.
Go through an old Twitter deck on why Pig is good.
Read about why Spark is a Crossover Hit For Data Scientists.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20-map_reduce

20-map_reduce

README.md

Before

Questions

During

The AWS Command Line Interface (CLI)

Streaming map-reduce with Python

More abstraction

Beyond Map-Reduce

`sklearn` for huge data?

After

Files

20-map_reduce

Directory actions

More options

Directory actions

More options

Latest commit

History

20-map_reduce

Folders and files

parent directory

README.md

Before

Questions

During

The AWS Command Line Interface (CLI)

Streaming map-reduce with Python

More abstraction

Beyond Map-Reduce

sklearn for huge data?

After

`sklearn` for huge data?