Computational Cluster Programs

Running hadoop on the Hoffman2 Cluster

Testing on a single compute node as well as submitting to cluster nodes using SGE

First find out if Apache Hadoop will benefit you by reading the wiki . Download and save hadoop tar file from hadoop common. We also have a downloaded tar file at the location /u/local/apps/hadoop.

Secure a compute node for your testing :
Each user can get an interactive node for running their application (Eg:- For a single node with 8 cores use the command below.)
qrsh -pe shared 8 -l num_proc=8

If you need the node for more than 2 hours use this command (Eg:- 8 hours):

qrsh -pe shared 8 -l num_proc=8,h_rt=8:00:00

STEP I: Installation and interactive testing on a single compute node

Before you submit a batch job, first make sure everything is working interactively. Follow these instructions one by one
  • cd
  • tar -zxvf hadoop-*.tar.gz (Eg:- tar -zxvf hadoop-1.0.2.tar.gz)
  • ln -s hadoop-1.0.2 hadoop
Update $HOME/.bashrc (once per installation for a new version of hadoop). csh/tcsh users use similar commands for $HOME/.cshrc or $HOME/.tcshrc
  • export HADOOP_HOME=$HOME/hadoop
  • export JAVA_HOME=/u/local/apps/java/jdk1.6.0_31
  • export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
Source $HOME/.bashrc in order to activate your updates
  • source $HOME/.bashrc
Modify masters, slaves, core-site.xml, mapred-site.xml, hdfs-site.xml files in $HADOOP_HOME/conf directory. Please remember that we are using port 9000 and 9001 for hdfs and Mapreduce respectively. We could make that as a variable as well.
  • echo `hostname` > masters
  • echo `hostname` > slaves
  • Modify core-site.xml for hadoop.tmp.dir and fs.default.name. Replace 'hostname' with actual hostname
    <property>
    <name> hadoop.tmp.dir <⁄name>
    <value>/work/hadoop-${user.name} <⁄value>
    <⁄property>
    <property>
    <name> fs.default.name <⁄name>
    <value> hdfs://hostname:9000 <⁄value>
    <⁄property>
  • Modify hdfs-site.xml for dfs.replication
    <property>
    <name> dfs.replication <⁄name>
    <value> 1 <⁄value>
    <⁄property>
  • Modify mapred-site.xml for mapred.job.tracker. Replace 'hostname' with actual hostname
    <property>
    <name> mapred.job.tracker <⁄name>
    <value> hostname:9001 <⁄value>
    <⁄property>
Following commands are issued from $HADOOP_HOME
  • Modify $HADOOP_HOME/conf/hadoop-env.sh to add $JAVA_HOME
  • cd $HADOOP_HOME
  • mkdir input
  • cp conf/*.xml input
  • $HADOOP_HOME/bin/hadoop namenode -format
  • $HADOOP_HOME/bin/start-all.sh
  • netstat -plten | grep java
  • jps
  • $HADOOP_HOME/bin/hadoop fs -put conf input
  • $HADOOP_HOME/bin/hadoop jar hadoop-examples-1.0.2.jar grep input output 'dfs[a-z.]+'
  • $HADOOP_HOME/bin/hadoop fs -get output output
  • cat output/*
  • $HADOOP_HOME/bin/stop-all.sh

STEP II: Batch job submission setup (cluster of nodes) and submitting to SGE scheduler on Hoffman2 cluster

Hoffman2 has a special parallel queue called hadoop.q which has a prolog script that will automatically parse the PE_HOSTFILE and modify $HADOOP_HOME/conf files (masters, slaves, hdfs-site.xml, mapred-site.xml, core-site.xml) at run time. Prepare a SGE command file with following lines and save it as hdfs-sge.cmd.
             #!/bin/sh
             #$ -o out.$JOB_ID
             #$ -j y
             #  change the number from 16 to 32 (4 nodes) or 64 (8 nodes)
             #$ -pe 8threads 16
             #$ -cwd
             # same as above line but use different set of nodes. Use either one and not both.
             ## SGE may be able to find nodes with free 4 cores faster than free 8 cores. So, you may use this option
             ##$ -pe 4threads 16 
             #  No need to change this line, this line tells SGE to run jobs in hadoop.q
             #$ -l hadoop
             
             #
             #  The three lines below are optional, instead pass it through the qsub command as described below
             #
             export JAVA_HOME=/u/local/apps/java/jdk1.6.0_31
             #export HADOOP_HOME=$HOME/hadoop
             export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

             cd $HADOOP_HOME
             export HADOOP_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID}

             echo $HADOOP_CONF_DIR

             $HADOOP_HOME/bin/hadoop namenode -format
             sleep 10
             $HADOOP_HOME/bin/start-dfs.sh
             cntdfs=0
	     count=0
	     if [ -z "${JAVA_HOME}" ]
	     then
	         echo "JAVA_HOME not defined"
	         $HADOOP_HOME/bin/stop-dfs.sh
	         exit
	     else
	        while [[ $cntdfs -lt 3 || $count -le 20 ]]
	        do
	           sleep 2
	           cntdfs=`cat $HADOOP_CONF_DIR/slaves $HADOOP_CONF_DIR/masters | xargs -i ssh {} $JAVA_HOME/bin/jps | grep -v Jps | wc -l`
	           count=`expr ${count} + 1`
	        done
	        if [ $count -eq 20 ]
	        then
	            echo "Hadoop dfs did not start in 20 tries"
	            $HADOOP_HOME/bin/stop-dfs.sh
	            exit
	         fi
	     fi
             $HADOOP_HOME/bin/start-mapred.sh
             sleep 10
             $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/conf input
             sleep 10
             $HADOOP_HOME/bin/hadoop jar hadoop-examples-1.0.2.jar  grep input output 'dfs[a-z.]+'
             $HADOOP_HOME/bin/hadoop fs -get output $HADOOP_HOME/output
             $HADOOP_HOME/bin/stop-mapred.sh
             $HADOOP_HOME/bin/stop-dfs.sh
          
Submit the job to the cluster using the command below. (Please change HADOOP_HOME according to your installation)
  • qsub -v HADOOP_HOME=$HOME/hadoop-1.0.2 hdfs-sge.cmd
If you want to provide JAVA_HOME and HADOOP_HOME during the submission step use the command below (make appropriate changes for the real value of these two environment variables).
  • qsub -v JAVA_HOME=/u/local/apps/java/jdk1.6.0_20,HADOOP_HOME=$HOME/hadoop-1.0.2 hdfs-sge.cmd
If you see any errors, please resubmit the job. We noticed some timeout issues with hadoop or JAVA on some nodes. Only solution that we know at this time is to resubmit. Also, SGE will clean up all temporary files. So, remember to copy your output back to home directory inside the SGE command file.

Hadoop Shell Commands

In order to find usage try typing "hadoop dfs -shellcommand" (Eg:- hadoop dfs -put or hadoop dfs -help)
                cat
                chgrp
                chmod
                chown
                copyFromLocal
                copyToLocal
                cp
                du
                dus
                expunge
                get
                getmerge
                ls
                lsr
                mkdir
                movefromLocal
                mv
                put
                rm
                rmr
                setrep
                stat
                tail
                test
                text
                touchz 
          

Compiling and running JAVA code

Examples are available in /u/local/apps/hadoop/examples directory on Hoffman2
            mkdir wordcount_classes
            javac -classpath $HADOOP_HOME/hadoop-0.20.2-core.jar -d wordcount_classes WordCount.java
            jar -cvf wordcount.jar -C wordcount_classes/ .

            create few text files in a directory called input as your exmaple input files for testing

            hadoop dfs -copyFromLocal input input
            hadoop dfs -ls
            $HADOOP_HOME/bin/hadoop jar wordcount.jar  org.myorg.WordCount input output
            hadoop dfs -ls
              

Python Example

Examples are available in /u/local/apps/hadoop/examples directory on Hoffman2. The input files are in the directory gutenberg
             hadoop dfs -copyFromLocal gutenberg gutenberg
             hadoop dfs -ls
             hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.2.jar -file ./mapper.py -mapper 
                        ./mapper.py -file ./reducer.py -reducer ./reducer.py -input gutenberg/* 
                        -output gutenberg-output
             hadoop dfs -copyToLocal gutenberg-output gutenberg-output
             hadoop dfs -ls