hadoop_record : A python record reader for hadoop

This library reads the output of Hadoop CSV files so you can easily use them in python streaming programs.

tl;dr

git clone git://github.com/ptarjan/hadoop_record.git
cd hadoop_record/example/
cat sample.txt | ./mapper.py | sort | ./reducer.py
 en      2
 ru      1

Features

Decodes hadoop jute records
Doesn't docode strings until they are used (good for large data sets when you only care about a part of it)

Command line Example

>>> from hadoop_record import csv
>>> csv("T")
True
>>> csv(";-1234")
-1234
>>> csv("1.0E-10")
1e-10
>>> csv("s{T,F}")
[True, False]
>>> csv("v{T,F}")
[True, False]
>>> csv("v{s{T,F}}")
[[True, False]]
>>> csv("m{'don't,#73746f70}")
{LazyString("don't"): LazyString('stop')}
>>> csv("'\xe2\x98\x83")
LazyString('\xe2\x98\x83')
>>> str(csv("'\xe2\x98\x83"))
'\xe2\x98\x83'
>>> unicode(csv("'\xe2\x98\x83"))
u'\u2603'
>>> csv("'%00%0a%25%2c")
LazyString('\x00\n%,')

Hadoop

git clone git://github.com/ptarjan/hadoop_record.git
cd hadoop_record/example/
hadoop fs -put sample.txt .
hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input sample.txt -output sample_output -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -file yahoo.py -file hadoop_record.mod
hadoop fs -cat sample_output/*
 en      2
 ru      1

And if you have a binary record, you need:

-inputformat SequenceFileAsTextInputFormat -file JuteRecordClasses.jar

and you're good to go. Like

hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input /data/data_in_jute_format/part-0* -inputformat SequenceFileAsTextInputFormat -output output_dir -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -file yahoo.py -file JuterecordClasses.jar -file hadoop_record.mod

With mapper.py, reducer.py, and yahoo.py from the examples directory.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
example		example
ply		ply
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
parser.py		parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hadoop_record : A python record reader for hadoop

tl;dr

Features

Command line Example

Hadoop

About

Releases

Packages

Languages

ptarjan/hadoop_record

Folders and files

Latest commit

History

Repository files navigation

hadoop_record : A python record reader for hadoop

tl;dr

Features

Command line Example

Hadoop

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages