First find out if Apache Hadoop will benefit you by reading the wiki . Download and save hadoop tar file from hadoop common. We also have a downloaded tar file at the location /u/local/apps/hadoop.
- Secure a compute node for your testing :
- Each user can get an interactive node for running their application (Eg:- For a single node with 8 cores use the command below.)
qrsh -pe shared 8 -l num_proc=8If you need the node for more than 2 hours use this command (Eg:- 8 hours):
qrsh -pe shared 8 -l num_proc=8,h_rt=8:00:00
- Before you submit a batch job, first make sure everything is working interactively. Follow these instructions one by one
- cd
- tar -zxvf hadoop-*.tar.gz (Eg:- tar -zxvf hadoop-1.0.2.tar.gz)
- ln -s hadoop-1.0.2 hadoop
- Update $HOME/.bashrc (once per installation for a new version of hadoop). csh/tcsh users use similar commands for $HOME/.cshrc or $HOME/.tcshrc
- export HADOOP_HOME=$HOME/hadoop
- export JAVA_HOME=/u/local/apps/java/jdk1.6.0_31
- export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
- Source $HOME/.bashrc in order to activate your updates
- source $HOME/.bashrc
- Modify masters, slaves, core-site.xml, mapred-site.xml, hdfs-site.xml files in $HADOOP_HOME/conf directory. Please remember that we are using port 9000 and 9001 for hdfs and Mapreduce respectively. We could make that as a variable as well.
- echo `hostname` > masters
- echo `hostname` > slaves
- Modify core-site.xml for hadoop.tmp.dir and fs.default.name. Replace 'hostname' with actual hostname
<property><name> hadoop.tmp.dir <⁄name><value>/work/hadoop-${user.name} <⁄value><⁄property><property><name> fs.default.name <⁄name><value> hdfs://hostname:9000 <⁄value><⁄property>- Modify hdfs-site.xml for dfs.replication
<property><name> dfs.replication <⁄name><value> 1 <⁄value><⁄property>- Modify mapred-site.xml for mapred.job.tracker. Replace 'hostname' with actual hostname
<property><name> mapred.job.tracker <⁄name><value> hostname:9001 <⁄value><⁄property>- Following commands are issued from $HADOOP_HOME
- Modify $HADOOP_HOME/conf/hadoop-env.sh to add $JAVA_HOME
- cd $HADOOP_HOME
- mkdir input
- cp conf/*.xml input
- $HADOOP_HOME/bin/hadoop namenode -format
- $HADOOP_HOME/bin/start-all.sh
- netstat -plten | grep java
- jps
- $HADOOP_HOME/bin/hadoop fs -put conf input
- $HADOOP_HOME/bin/hadoop jar hadoop-examples-1.0.2.jar grep input output 'dfs[a-z.]+'
- $HADOOP_HOME/bin/hadoop fs -get output output
- cat output/*
- $HADOOP_HOME/bin/stop-all.sh
- Hoffman2 has a special parallel queue called hadoop.q which has a prolog script that will automatically parse the PE_HOSTFILE and modify $HADOOP_HOME/conf files (masters, slaves, hdfs-site.xml, mapred-site.xml, core-site.xml) at run time. Prepare a SGE command file with following lines and save it as hdfs-sge.cmd.
#!/bin/sh #$ -o out.$JOB_ID #$ -j y # change the number from 16 to 32 (4 nodes) or 64 (8 nodes) #$ -pe 8threads 16 #$ -cwd # same as above line but use different set of nodes. Use either one and not both. ## SGE may be able to find nodes with free 4 cores faster than free 8 cores. So, you may use this option ##$ -pe 4threads 16 # No need to change this line, this line tells SGE to run jobs in hadoop.q #$ -l hadoop # # The three lines below are optional, instead pass it through the qsub command as described below # export JAVA_HOME=/u/local/apps/java/jdk1.6.0_31 #export HADOOP_HOME=$HOME/hadoop export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH cd $HADOOP_HOME export HADOOP_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID} echo $HADOOP_CONF_DIR $HADOOP_HOME/bin/hadoop namenode -format sleep 10 $HADOOP_HOME/bin/start-dfs.sh cntdfs=0 count=0 if [ -z "${JAVA_HOME}" ] then echo "JAVA_HOME not defined" $HADOOP_HOME/bin/stop-dfs.sh exit else while [[ $cntdfs -lt 3 || $count -le 20 ]] do sleep 2 cntdfs=`cat $HADOOP_CONF_DIR/slaves $HADOOP_CONF_DIR/masters | xargs -i ssh {} $JAVA_HOME/bin/jps | grep -v Jps | wc -l` count=`expr ${count} + 1` done if [ $count -eq 20 ] then echo "Hadoop dfs did not start in 20 tries" $HADOOP_HOME/bin/stop-dfs.sh exit fi fi $HADOOP_HOME/bin/start-mapred.sh sleep 10 $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/conf input sleep 10 $HADOOP_HOME/bin/hadoop jar hadoop-examples-1.0.2.jar grep input output 'dfs[a-z.]+' $HADOOP_HOME/bin/hadoop fs -get output $HADOOP_HOME/output $HADOOP_HOME/bin/stop-mapred.sh $HADOOP_HOME/bin/stop-dfs.sh- Submit the job to the cluster using the command below. (Please change HADOOP_HOME according to your installation)
- qsub -v HADOOP_HOME=$HOME/hadoop-1.0.2 hdfs-sge.cmd
- If you want to provide JAVA_HOME and HADOOP_HOME during the submission step use the command below (make appropriate changes for the real value of these two environment variables).
- qsub -v JAVA_HOME=/u/local/apps/java/jdk1.6.0_20,HADOOP_HOME=$HOME/hadoop-1.0.2 hdfs-sge.cmd
- If you see any errors, please resubmit the job. We noticed some timeout issues with hadoop or JAVA on some nodes. Only solution that we know at this time is to resubmit. Also, SGE will clean up all temporary files. So, remember to copy your output back to home directory inside the SGE command file.
- In order to find usage try typing "hadoop dfs -shellcommand" (Eg:- hadoop dfs -put or hadoop dfs -help)
cat chgrp chmod chown copyFromLocal copyToLocal cp du dus expunge get getmerge ls lsr mkdir movefromLocal mv put rm rmr setrep stat tail test text touchz
- Examples are available in /u/local/apps/hadoop/examples directory on Hoffman2
mkdir wordcount_classes javac -classpath $HADOOP_HOME/hadoop-0.20.2-core.jar -d wordcount_classes WordCount.java jar -cvf wordcount.jar -C wordcount_classes/ . create few text files in a directory called input as your exmaple input files for testing hadoop dfs -copyFromLocal input input hadoop dfs -ls $HADOOP_HOME/bin/hadoop jar wordcount.jar org.myorg.WordCount input output hadoop dfs -ls
- Examples are available in /u/local/apps/hadoop/examples directory on Hoffman2. The input files are in the directory gutenberg
hadoop dfs -copyFromLocal gutenberg gutenberg hadoop dfs -ls hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.2.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input gutenberg/* -output gutenberg-output hadoop dfs -copyToLocal gutenberg-output gutenberg-output hadoop dfs -ls