Hadoop_playground
By default, the EOS system discards the Map-stage outputs automatically after Map-Reduce completes. Unfortunately, this makes it difficult to debug Hadoop Streaming programs. An easy way around this is to run the MR method 3 times, once with map and no reduce, once with reduce and no map, and once with both map and reduce.
For example, to run streaming with no reduce:
hadoop jar $HADOOP_STREAMING \
-mapper "./map.out" \
-reducer NONE \
-file map.out \
-verbose \
-input $TAMUSC_USER_WORK_DIR/inputs/ \
-output $TAMUSC_USER_WORK_DIR/outputs/maponly
Note By running this code with a C-based word count program, I found that Map generally outputs 8 files labeled Part-0000 - Part-0007. However, Reduce generally only outputs a single file. I wonder how this holds for larger files.
I found that using hadoop -fs get does not overwrite files if they exist in the target directory nor does it tell us whether or not this condition occurs. I ran into the situation where I would run my code but no matter what changes I would make to my reduce program, the reduce-output files were empty. Verifying successful mapping was simple enough; I just used the method described in Debugging Map and Reduce Separately and at the map outputs directly.
The Quick Fix for this issue with Hadoop fs -get/put is to always target a unique directory. For example, I used target folders with some form of $PBS_JOBID written in them (since the EOS PBS system gives us unique Job ID's).
http://sc.tamu.edu/help/eos/batch/
According to the Hadoop Wiki, it should be possible to run Clean-up scripts if either map and/or reduce fails. More information can be found at http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms.
http://srinathsview.blogspot.com/2012/05/debugging-hadoop-task-tracker-job.html
I will play around with some of these scripts and post my findings soon.
Something interesting I found (but have yet to try) is that the EOS PBS system lets us choose our std streams. I might be able to use this to create a named stderr stream for my C programs to write logs to.