Friday, November 12, 2010

Building Simple MapReduce java Program

Map Reduce is a combination of two functions map() and reduce().

Main class for a simple MapReduce Java Application :

public class Main
public static void main (String ap[])
MyMapReduce my = new MyMapReduce();
my.init ();

It just instantiates a class called, 'MyMapReduce'.

MapReduce Program for Factorial :

import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.*;

public static class Map extends MapReduceBase implements Mapper <LongWritable, Text, Text, Text>
             private Text word = new Text();
             private final static Text location = new Text();
             public void map(LongWritable key, Text value, OutputCollector <Text,                                         Text> output, Reporter reporter) throws IOException
                   String line = value.toString();
                   StringTokenizer tokenizerLine = new StringTokenizer(Line, “\n”);
                   Text T1 = new Text();
                   Text t2 = new Text();
                   int num;
                   while (tokenizerLine.hasmoreTokens())
                          String tokenAsLine = tokenizerLine.nextToken();
                          StringTokenizer tokenizerWord = new StringTokenizer                                                                      (tokenAsLine);
                          List s1 = new ArrayList();
                          while (tokenizerLine.hasMoreTokens())
                                     String tokenAsLine = tokenizerLine.nextToken();
                                     StringTokenizer tokenizerWord = new StringTokenizer                                                                              (tokenAsList);
                                     List s1=new ArrayList();
                                     while (tokenizerWord.hasMoreTokens())
                                     for(int i=0; i<=(s1.size()-1); i++)
                                             num = Integer.parseInt((String)s1.get(i));
                                             int fact=1;
                                     for (int j=1 ; j>= num ; j++)
                                            fact = fact * j;
                                    t2.set(“ ” + fact);
                                    output.collect(t1 , t2);

public static class Reduce extends MapReduceBase implements Reducer <Text,                                              Text, Text, Text>
                public void reduce (Text key, Iterator <Text> values, outputCollector                          <Text, Text> output, Reporter reporter) throws IOException
                           boolean first = true;
                           StringBuilder toReturn = new StringBuilder();
                           while (values.hasNext())
                                    toReturn.append(“ , ”);
                                    first = false;

public static void main(String ap[])
    JobConf conf= new JobConf (Factorial.class);
    FileInputFormat.setInputPaths(conf, new Path(ap[0]));
    FileOutputFormat.setOutputPath(conf, new Path (ap[1]));
          conf.set(“io.sort.mb”, “10”);
    catch(IOException e)

Tuesday, November 2, 2010

Building HADOOP CLUSTER [ Using 2 Linux Machines ]

STEP 1) Install Java 6 or above on Linux machine ( jdk1.6.0.12 )
I am having 'jdk-6u12-linux-i586.bin' on my REDHAT machine.
To Install follow commands :
# chmod 744 jdk-6u12-linux-i586.bin
# ./ jdk-6u12-linux-i586.bin

STEP 2) Download ''
extract it.
# cp -f jce/*.jar $JAVA_HOME/jre/lib/seciruty/
# chmod 444 $JAVA_HOME/jre/lib/seciruty/*.jar

STEP 3) Download hadoop-0.20.0.tar.gz or any latest version
extract it and copy ' hadoop-0.20.0' folder to '/usr/local/' directory.

# export JAVA_HOME=/java_installation_folder/jdk1.6.0_12
# export HADOOP_HOME=/usr/local/hadoop-0.20.2

Install same on second Linux machine
Then Description of machines is :

Server IP                             HostName                                Role

1)             hostmaster         Master [ NameNode and JobTracker ]
2)             hostslave            Slave [ Datanode and TaskTracker]

STEP 6) Now do following settings on Master :

# vim /etc/hosts
make changes as...
comment all and write at the end hostmaster
save and exit

Changes to be made on Slave Machine :

# vim /etc/hosts
make changes as...
comment all and write at the end hostslave hostmaster
save and exit

STEP 7) For Communication setup SSH :

Do the steps on master as well as on slave-
# ssh-keygen -t rsa
it generates the RSA public & private keys.
This is because Hadoop Master Node communicates with Slave Node using SSH.
This will generate '' file under '/root/.ssh' directory. Now rename the Master's to '' and copy it to Slave Node (at same path).
Then execute the following command to add the Master's public key to the Slave's authorized keys.

# cat /root/.ssh/ >> /root/.ssh/authorized_keys

Now try to ssh the Slave Node. It should be connected without needing any password.

# ssh

STEP 8) Setting up MASTER NODE :
Setup Hadoop to work in a fully distributed mode by configuring the configuration files under the $HADOOP_HOME/conf/ directory.

Configuration Property :
Property                                               Explanation
1)                              NameNode URI
2) mapred.job.tracker                       JobTracker URI
3) dfs.replication                                Number of replication
4) hadoop.tmp.dir (optional)              Temp Directory

Let us Start with Configuration files :

1) $HADOOP_HOME/conf/
make change as...
export JAVA_HOME=/java_installation_folder/jdk1.6.0_12

2) $HADOOP_HOME/conf/core-site.xml


3) $HADOOP_HOME/conf/hdfs-site.xml


4) $HADOOP_HOME/conf/mapred-site.xml


5) $HADOOP_HOME/conf/masters

6) $HADOOP_HOME/conf/slaves

Now copy all these files to /conf directory of SLAVE Machine.

STEP 9) Setup Master and Slave Node : (run on both machines)

# hadoop namenode -format

Now your Cluster is Ready to run Jobs