tag:blogger.com,1999:blog-1496788308131882332024-02-20T11:32:00.479-08:00Planet Hadoop...!!!Pranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-149678830813188233.post-56860805625774233052011-04-14T02:12:00.000-07:002011-04-14T02:12:10.574-07:00PIG LATIN<div dir="ltr" style="text-align: left;" trbidi="on"><br />
<strong><span style="font-size: large;"><span style="color: #38761d;"> Pig Latin</span></span></strong><br />
<br />
A Pig Latin is made of a series of operations, or transformations, that are applied to the input data to produce output.<br />
Under the cover pig turns the transformations into a series of MapReduce jobs, but as a programmer you are mostly unaware of this, which allows you to focus on the data rather than the nature of the execution.<br />
<br />
Pig runs in 2 modes :<br />
1) Local Mode<br />
2) Hadoop Mode<br />
<br />
<span style="color: #38761d;"><strong>1) Local Mode :</strong> </span>In local mode Pig runs in a single JVM & accesses the local file system. This mode is suitable only for small datasets & when trying out Pig. Local mode doesn't use Hadoop. Also it doesn't use Hadoop's local job runner, instead Pig translates queries into a physical plan that it executes itself. The execution type is set using the -x or -exectype option. To run in local mode, set the option to local:<br />
<strong>$ pig -x local</strong><br />
<br />
<strong><span style="color: #38761d;">2) Hadoop Mode : </span></strong>In Hadoop mode, Pig translates queries into MapReduce jobs & runs them on a Hadoop cluster. To use Hadoop mode you need to tell Pig which vesion of Hadoop you are using & where your cluster is running.<br />
The Environment variable PIG_HADOOP_VERSION is used to tell Pig the version of Hadoop it is connecting to. <br />
<strong>$ export PIG_HADOOP_VERSION = 20</strong><br />
<br />
Next we need to point Pig at the cluster namenode & jobtracker. If you already have Hadoop site file that define fs.default.name & mapred.jobtracker you can simply add Hadoop's configuration directory to Pig's classpath :<br />
<strong>$ export PIG_CLASSPATH = $HADOOP_INSTALL/conf/</strong><br />
<strong> </strong><br />
Alternatively ou can create a <strong>pig.properties</strong> file in Pig's “conf” directory, which sets these two properties. Here is an example for a pseudo-distributed setup :<br />
<strong>fs.default.name=hdfs://localhost/<br />
mapred.jobtracker= localhost:8021</strong><br />
<strong> </strong><br />
once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting the -x option to MapReduce or omitting it entirely, as Hadoop mode is the default:<br />
<br />
<strong> <span style="color: #38761d;"> /bin/pigscr file</span></strong><br />
<strong>#!/bin/sh<br />
PIG_PATH = $HADOOP_HOME/bin/pig-0.7.0<br />
PIG_CLASSPATH = $PIG_PATH/pig-0.3.0-core.jar:$HADOOP_HOME/conf \ PIG_HADOOP_VERSION = 0.20.2 \ $PIG_PATH/bin/pig $@</strong></div>Pranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.com0tag:blogger.com,1999:blog-149678830813188233.post-46367561642579298692011-04-12T23:59:00.000-07:002011-04-12T23:59:27.234-07:00Installing PIG<div dir="ltr" style="text-align: left;" trbidi="on"><br />
<strong><span style="font-size: large;"> <span style="color: #990000;"> </span></span></strong><strong><span style="font-size: large;"><span style="color: #38761d;">Installing PIG</span></span></strong><br />
<br />
To install Pig on Linux we Need to install following Packages :<br />
<br />
<strong> </strong><br />
<strong>1) Install Hadoop 0.20.2 ( or Later)<br />
2) Java 1.6 or Later ( Set JAVA_HOME )<br />
3) Ant 1.7 ( optional for builds )<br />
4) Junit 4.5 ( optional for Unit tests)</strong><br />
<br />
Download PIG from one of the apache download mirror<br />
<strong>http://pig.apache.org/releases.html</strong> <br />
<br />
Unpack the downloaded PIG distribution. The pig script is located in the bin directory. <br />
Add “<strong>/pig-n.n.n/bin</strong>” to your path. Use <strong>export (bash,sh,ksh) </strong>or<br />
<strong>setenv (tcsh,csh)</strong><br />
<br />
<strong><span style="color: #38761d;">export PATH=/usr/local/Hadoop-0.20.2/bin/pig-0.7.0/bin:$PATH</span></strong><br />
<br />
TRY this also <strong>#pig -help</strong><br />
<strong>#pig ...(to start grunt)</strong><br />
<br />
<strong><span style="color: #38761d;"> Writing Scripts</span></strong><br />
<br />
Copy “<strong>/etc/passwd</strong>” file to <strong>/root</strong> write script “<strong>id.pig</strong>” as follows :<br />
<strong># vim /root/id.pig</strong><br />
<br />
<strong>A = load 'passwd' using PigStorage(':');<br />
B = foreach A generate $0 as id;<br />
dump B;<br />
STORE B into '$out';</strong><br />
<br />
save this file and exit<br />
<br />
copy '<strong>passwd' & 'id.pig</strong>' files in a directory suppose <strong>/root/inpig</strong><br />
then, <br />
<strong># Hadoop dfs -put /root/inpig</strong> (to insert your data in HDFS)<br />
<strong>grunt > run -param out=myoutput id.pig ... to run the script</strong><br />
<br />
Now output will b saved at '<strong>/user/root/myoutput/part-m-00000</strong>' file.<br />
<br />
<strong><span style="color: #38761d;"> Pig sample commands and their results</span></strong><br />
here is sample data '<strong>/data/one</strong>' file contains :<br />
a A 1<br />
b B 2<br />
c C 3<br />
a AA 11<br />
a AAA 111<br />
b BB 22<br />
<br />
And '<strong>/data/two</strong>' file contains :<br />
x X a<br />
y Y b<br />
x XX b<br />
z Z c<br />
<br />
So the sample script is<br />
<strong># vim test1.pig</strong><br />
<strong>one = load 'data/one' using PigStorage();<br />
two = load 'data/two' using PigStorage();<br />
<br />
generated = FOREACH one GENERATE $0, $2;</strong><br />
<br />
save & exit<br />
<br />
<strong><span style="color: #38761d;">RESULT :</span></strong><br />
(a, 1)<br />
(b, 2)<br />
(c, 3)<br />
(a, 11)<br />
(a, 111)<br />
(b, 22)<br />
<br />
<span style="color: #38761d;"><strong>Other Commands and their Results are as follows :</strong></span><br />
<br />
<strong>grouped = GROUP one BY $0;</strong><br />
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)})<br />
(b, {(b, B, 2), (b, BB, 22)})<br />
(c, {(c, C, 3)})<br />
<br />
<strong>grouped2 = GROUP one BY ($0, $1);</strong><br />
((a, A), {(a, A, 1)})<br />
((a, AA), {(a, AA, 11)})<br />
((a, AAA), {(a, AAA, 111)})<br />
((b, B), {(b, B, 2)})<br />
((b, BB), {(b, BB, 22)})<br />
((c, C), {(c, C, 3)})<br />
<br />
<strong>summed = FOREACH grouped GENERATE group, SUM(one.$2);</strong><br />
(a, 123.0)<br />
(b, 24.0)<br />
(c, 3.0)<br />
<br />
<strong>counted = FOREACH grouped GENERATE group, COUNT(one);</strong><br />
(a, 3)<br />
(b, 2)<br />
(c, 1)<br />
<br />
<strong>flat = FOREACH grouped GENERATE FLATTEN(one);</strong><br />
(a, A, 1)<br />
(a, AA, 11)<br />
(a, AAA, 111)<br />
(b, B, 2)<br />
(b, BB, 22)<br />
(c, C, 3)<br />
<br />
<strong>cogrouped = COGROUP one BY $0, two BY $2;</strong><br />
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)}, {(x, X, a)})<br />
(b, {(b, B, 2), (b, BB, 22)}, {(y, Y, b), (x, XX, b)})<br />
(c, {(c, C, 3)}, {(z, Z, c)})<br />
<br />
<strong>flatc = FOREACH cogrouped GENERATE FLATTEN(one.($0,$2)), FLATTEN(two.$1);</strong><br />
(a, 1, X)<br />
(a, 11, X)<br />
(a, 111, X)<br />
(b, 2, Y)<br />
(b, 22, Y)<br />
(b, 2, XX)<br />
(b, 22, XX)<br />
(c, 3, Z)<br />
<br />
<strong>joined = JOIN one BY $0, two BY $2;</strong><br />
(a, A, 1, x, X, a)<br />
(a, AA, 11, x, X, a)<br />
(a, AAA, 111, x, X, a)<br />
(b, B, 2, y, Y, b)<br />
(b, BB, 22, y, Y, b)<br />
(b, B, 2, x, XX, b)<br />
(b, BB, 22, x, XX, b)<br />
(c, C, 3, z, Z, c)<br />
<br />
<strong>crossed = CROSS one, two;</strong><br />
(a, AA, 11, z, Z, c)<br />
(a, AA, 11, x, XX, b)<br />
(a, AA, 11, y, Y, b)<br />
(a, AA, 11, x, X, a)<br />
(c, C, 3, z, Z, c)<br />
(c, C, 3, x, XX, b)<br />
(c, C, 3, y, Y, b)<br />
(c, C, 3, x, X, a)<br />
(b, BB, 22, z, Z, c)<br />
(b, BB, 22, x, XX, b)<br />
(b, BB, 22, y, Y, b)<br />
(b, BB, 22, x, X, a)<br />
(a, AAA, 111, x, XX, b)<br />
(b, B, 2, x, XX, b)<br />
(a, AAA, 111, z, Z, c)<br />
(b, B, 2, z, Z, c)<br />
(a, AAA, 111, y, Y, b)<br />
(b, B, 2, y, Y, b)<br />
(b, B, 2, x, X, a)<br />
(a, AAA, 111, x, X, a)<br />
(a, A, 1, z, Z, c)<br />
(a, A, 1, x, XX, b)<br />
(a, A, 1, y, Y, b)<br />
(a, A, 1, x, X, a)<br />
<br />
<strong>SPLIT one INTO one_under IF $2 < 10, one_over IF $2 >= 10;<br />
-- one_under:</strong><br />
(a, A, 1)<br />
(b, B, 2)<br />
(c, C, 3)<br />
</div>Pranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.com0tag:blogger.com,1999:blog-149678830813188233.post-19263675265180391412011-03-17T03:38:00.000-07:002011-03-17T03:39:40.616-07:00<div dir="ltr" style="text-align: left;" trbidi="on"><div align="left"><br />
<b><span style="color: #38761d;"><span style="font-size: large;"> PIG</span></span></b><br />
<br />
Pig scripts can be run in two modes – a) Local mode<br />
b) Hadoop Mode<br />
1) <span style="color: #38761d;"><b>Local Mode :</b> </span>To run the scripts in local mode, Hadoop or HDFS installation is not required. All files are installed & run from your local host & file system.<br />
2) <b><span style="color: #38761d;">Hadoop Mode :</span></b> To run the scripts in Hadoop ( MapReduce ) mode, we need access to a Hadoop cluster & HDFS installation available through Hadoop Virtual machine.<br />
Pig tutorial files are installed on the Hadoop Virtual machine under “/home/hadoop-user/pig” directory. <br />
<br />
<span style="color: #38761d;"><b>Getting Started :</b></span><br />
1) Install java<br />
2) Download Pig tutorial file & install Pig<br />
3) Run the Pig scripts – in local mode or on a Hadoop mode.<br />
<br />
Pig is a platform for analyzing large data sets that consists of a large high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.<br />
The salient property of Pig programs is that their structure is amenable to substancial parallelization, which in turns enavles them to handle very large data sets.<br />
Pigs infrastructure layer consists of a compiler that produces sequences of map-reduce programs, for which large scale parallel implementations already exists. Pigs languages layer currently consiste of a textual language called, “Pig Latin”, which has the following key properties.<br />
</div><div align="left">1) <b><span style="color: #38761d;">Ease of Programming :</span></b><br />
It is a trivial to achieve parallel execution of simple, 'embarrassingly parallel' data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand & maintain.<br />
</div><div align="left">2) <b><span style="color: #38761d;">Optimization opportunities :</span></b><br />
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus semantics rather than efficiency.<br />
</div><div align="left">3) <b><span style="color: #38761d;">Extensibility :</span></b><br />
Users can create their own functions to do special purpose processing.<br />
<br />
Pig is a system for processing large semistructured data sets using Hadoop MapReduce platform.<br />
<br />
<b><span style="color: #38761d;">Pig Latin :</span> </b>High level procedural language.<br />
<br />
<b><span style="color: #38761d;">Pig Engine : </span></b>Parser, optimizer & distributed query execution.<br />
<br />
<b><span style="color: #38761d;"><span style="font-size: large;"> Example WordCount using PIG</span></span></b><br />
<b>file name : wordcount.pig</b><br />
<br />
myinput = load '/user/wc.txt' USING TextLoader() as (text_line:chararray);<br />
<br />
words = FOREACH myinput GENERATE FLATTEN (TOKENSIZE ($0));<br />
<br />
grouped = GROUP words BY $0;<br />
grouped schema : { (group,words) }<br />
counts = FOREACH grouped GENERATE group, COUNT (words);<br />
<br />
store counts into '/user/pigoutput' using PigStorage();<br />
<br />
<b>save file and quit</b><br />
<br />
<b><span style="color: #38761d;">Pig is a higher level Perl script.</span></b><br />
<br />
<br />
<b># java -Xmx1024M -cp pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main wordcount.pig</b><br />
<br />
[ set path of “conf dir” in $HADOOP_CONF_DIR variable ].</div></div>Pranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.com0tag:blogger.com,1999:blog-149678830813188233.post-11679611235407850952010-12-06T05:27:00.000-08:002010-12-06T05:27:44.221-08:00Use Hadoop Cluster in Astronomy <strong><span style="font-size: large;"> <span style="color: #38761d;">Using Hadoop Cluster in Astronomy</span></span></strong><br />
<strong><span style="font-size: large;"><span style="color: #38761d;"> </span></span></strong><br />
Astrophysics is a branch of Physics dealing with many fundamental things about the nature of the universe. e.g. Studying properties of dark matter & the nature of dark matter & the nature of dark energy.<br />
To accomplish this goal it requires new methodologies for analysing & understanding petascale data sets ( i.e. data being collected at a rate 1000X greater than current surveys.) This research focusses on exploring an emerging paradigm for data intensive applications MapReduce & how it scales to the analysis of astronomical images.<br />
<br />
<strong><span style="color: #38761d;">Map Reduce in Astronomy :</span></strong><br />
<strong><span style="color: #38761d;"> </span></strong><br />
In order to exploit the elastic nature of the computational cloud, where many computers can be used at the same time, one requires an efficient way of writing parallel programs. The High Performance Computing [HPC] community has been developing such programs for roughly 20 years. The mapreduce model allows programmers to write a map function, which takes a key / value pair (e.g. id and file-name), operates on it (performs object detection) and returns a new set of key / value pairs (source list). The reduce function then aggregates / merges all the intermediate data (builds an object catalog on a stacked source list). Many problems in astronomy naturally fall into this model because of the inherent parallelizability of many astronomical tasks. The benefits are thatmapreduce is easy to write & the framework provides automatic load balancing.<br />
<br />
1] Image Mosaicing is a general tool required by Astronomers. The SLOAN DIGITAL SURVEY [ SDSS] alone generated 1.3 million astronomical images. Combining these files to form larger composite images or to stack the individual images to detect faint sources enables a broad range of science questions to be detected (from the detection of moving asteroids that are too faint to be seen on a single image to the identification of very faint, high-red shift galaxies).<br />
Next generation astronomical surveys such as the LSST will generate 30 TB of images per night, detecting transient sources, moving objects & hundreds of millions of stars & galaxies.<br />
The mapreduce model specifies a computation that takes a set of input key / value pairs & produces a set of output values. We divide the computation into 3 distinct phases: Map, Reduce & Shuffle.<br />
Shuffle is optional capability of Reduce mechanism, but since we make heavy use of it, we will therefore consider it a unique phase.<br />
Map takes an input pair & 'emits' a set of intermediate key / value pairs. These intermediate pairs are then shuffled among processors by means of a user-supplied partitioning function. The reduce method then operates on the shuffled key / value pairs and returns a set of values. In this user-supplied methods in each phase can only see local data. Data is transmitted among compute nodes only between phases.Pranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.com0tag:blogger.com,1999:blog-149678830813188233.post-89487857544439006112010-11-12T02:58:00.000-08:002010-11-12T02:59:45.217-08:00Building Simple MapReduce java Program<b> <span style="color: #38761d;"><span style="font-size: large;"> Building Simple MapReduce java Program</span></span></b><br />
<br />
Map Reduce is a combination of two functions map() and reduce().<br />
<br />
Main class for a simple MapReduce Java Application :<br />
<br />
public class Main<br />
{<br />
public static void main (String ap[])<br />
{ <br />
MyMapReduce my = new MyMapReduce();<br />
my.init ();<br />
}<br />
}<br />
<br />
It just instantiates a class called, 'MyMapReduce'.<br />
<br />
<b><span style="color: #38761d;">MapReduce Program for Factorial :</span></b><br />
<br />
<br />
<span style="color: #0b5394;">import java.io.IOException;</span><br />
<span style="color: #0b5394;">import java.util.*;</span><br />
<span style="color: #0b5394;">import org.apache.hadoop.fs.Path;</span><br />
<span style="color: #0b5394;">import org.apache.hadoop.io.*;</span><br />
<span style="color: #0b5394;">import org.apache.hadoop.mapred.*;</span><br />
<span style="color: #0b5394;"><br />
</span><br />
<span style="color: #0b5394;">public static class Map extends MapReduceBase implements Mapper <LongWritable, Text, Text, Text></span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> private Text word = new Text();</span><br />
<span style="color: #0b5394;"> private final static Text location = new Text();</span><br />
<span style="color: #0b5394;"> public void map(LongWritable key, Text value, OutputCollector <Text, Text> output, Reporter reporter) throws IOException</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> String line = value.toString();</span><br />
<span style="color: #0b5394;"> StringTokenizer tokenizerLine = new StringTokenizer(Line, “\n”);</span><br />
<span style="color: #0b5394;"> Text T1 = new Text();</span><br />
<span style="color: #0b5394;"> Text t2 = new Text();</span><br />
<span style="color: #0b5394;"> int num;</span><br />
<span style="color: #0b5394;"> while (tokenizerLine.hasmoreTokens())</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> String tokenAsLine = tokenizerLine.nextToken();</span><br />
<span style="color: #0b5394;"> StringTokenizer tokenizerWord = new StringTokenizer (tokenAsLine);</span><br />
<span style="color: #0b5394;"> List s1 = new ArrayList();</span><br />
<span style="color: #0b5394;"> while (tokenizerLine.hasMoreTokens())</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> String tokenAsLine = tokenizerLine.nextToken();</span><br />
<span style="color: #0b5394;"> StringTokenizer tokenizerWord = new StringTokenizer (tokenAsList);</span><br />
<span style="color: #0b5394;"> List s1=new ArrayList();</span><br />
<span style="color: #0b5394;"> while (tokenizerWord.hasMoreTokens())</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> s1.add(tokenizerWord.nextToken());</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> for(int i=0; i<=(s1.size()-1); i++)</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> num = Integer.parseInt((String)s1.get(i));</span><br />
<span style="color: #0b5394;"> int fact=1;</span><br />
<span style="color: #0b5394;"> for (int j=1 ; j>= num ; j++)</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> fact = fact * j;</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> t1.set((String)s1.get(i));</span><br />
<span style="color: #0b5394;"> t2.set(“ ” + fact);</span><br />
<span style="color: #0b5394;"> output.collect(t1 , t2);</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"><br />
</span><br />
<span style="color: #0b5394;">public static class Reduce extends MapReduceBase implements Reducer <Text, Text, Text, Text></span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> public void reduce (Text key, Iterator <Text> values, outputCollector <Text, Text> output, Reporter reporter) throws IOException</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> boolean first = true;</span><br />
<span style="color: #0b5394;"> StringBuilder toReturn = new StringBuilder();</span><br />
<span style="color: #0b5394;"> while (values.hasNext())</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> if(!first)</span><br />
<span style="color: #0b5394;"> toReturn.append(“ , ”);</span><br />
<span style="color: #0b5394;"> first = false;</span><br />
<span style="color: #0b5394;"> toReturn.append(values.next().toString());</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"><br />
</span><br />
<span style="color: #0b5394;">public static void main(String ap[])</span><br />
<span style="color: #0b5394;">{</span><br />
<span style="color: #0b5394;"> JobConf conf= new JobConf (Factorial.class); </span><br />
<span style="color: #0b5394;"> conf.setJobName(“factorial”);</span><br />
<span style="color: #0b5394;"> conf.setOutputKeyClass(Text.class);</span><br />
<span style="color: #0b5394;"> conf.setMapperClass(map.class);</span><br />
<span style="color: #0b5394;"> conf.setReducerClass(TextInputFormat.class);</span><br />
<span style="color: #0b5394;"> conf.setOutputFormat(TextOutputFormat.class);</span><br />
<span style="color: #0b5394;"> FileInputFormat.setInputPaths(conf, new Path(ap[0]));</span><br />
<span style="color: #0b5394;"> FileOutputFormat.setOutputPath(conf, new Path (ap[1]));</span><br />
<span style="color: #0b5394;"> try</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> conf.set(“io.sort.mb”, “10”);</span><br />
<span style="color: #0b5394;"> JobClient.runJob(conf);</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;"> catch(IOException e)</span><br />
<span style="color: #0b5394;"> {</span><br />
<span style="color: #0b5394;"> System.err.println(e.getMessage());</span><br />
<span style="color: #0b5394;"> }</span><br />
<span style="color: #0b5394;">}</span>Pranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.com0tag:blogger.com,1999:blog-149678830813188233.post-12669233175231559842010-11-02T05:36:00.000-07:002010-11-02T05:36:43.789-07:00Building HADOOP CLUSTER [ Using 2 Linux Machines ]<strong><span style="font-size: large;"><span style="color: #38761d;">Building HADOOP CLUSTER [Using 2 Linux Machines]</span></span></strong><br />
<br />
<strong><span style="color: #38761d;">STEP 1)</span></strong> Install Java 6 or above on Linux machine ( jdk1.6.0.12 )<br />
I am having 'jdk-6u12-linux-i586.bin' on my REDHAT machine.<br />
To Install follow commands :<br />
<strong># chmod 744 jdk-6u12-linux-i586.bin<br />
# ./ jdk-6u12-linux-i586.bin</strong><br />
<br />
<strong><span style="color: #38761d;">STEP 2) </span></strong>Download 'jce-policy-6.zip'<br />
extract it.<br />
<strong># cp -f jce/*.jar $JAVA_HOME/jre/lib/seciruty/<br />
# chmod 444 $JAVA_HOME/jre/lib/seciruty/*.jar</strong><br />
<br />
<strong><span style="color: #38761d;">STEP 3) </span></strong>Download hadoop-0.20.0.tar.gz or any latest version<br />
extract it and copy <strong>' hadoop-0.20.0'</strong> folder to '<strong>/usr/local/</strong>' directory.<br />
<br />
<strong><span style="color: #38761d;">STEP 4)</span></strong> Set JAVA PATH<br />
<strong># export JAVA_HOME=/java_installation_folder/jdk1.6.0_12</strong><br />
<strong><span style="color: #38761d;">STEP 5)</span></strong> Set HADOOP PATH<br />
<strong># export HADOOP_HOME=/usr/local/hadoop-0.20.2<br />
# export PATH=$PATH:SHADOOP_HOME/bin</strong><br />
Install same on second Linux machine<br />
Then Description of machines is :<br />
<br />
<strong>Server IP HostName Role</strong><br />
<br />
1) 192.168.100.19 hostmaster Master [ NameNode and JobTracker ]<br />
2) 192.168.100.17 hostslave Slave [ Datanode and TaskTracker] <br />
<br />
<br />
<strong><span style="color: #38761d;">STEP 6)</span></strong> <strong>Now do following settings on Master :</strong><br />
<br />
<strong># vim /etc/hosts</strong><br />
make changes as...<br />
comment all and write at the end<br />
192.168.100.19 hostmaster<br />
save and exit<br />
<br />
<strong>Changes to be made on Slave Machine :</strong><br />
<br />
<strong># vim /etc/hosts</strong><br />
make changes as...<br />
comment all and write at the end<br />
192.168.100.17 hostslave<br />
192.168.100.19 hostmaster<br />
save and exit<br />
<br />
<strong><span style="color: #38761d;">STEP 7) </span></strong><strong>For Communication setup SSH :</strong><br />
<br />
Do the steps on master as well as on slave-<br />
<strong># ssh-keygen -t rsa</strong><br />
it generates the RSA public & private keys.<br />
This is because Hadoop Master Node communicates with Slave Node using SSH.<br />
This will generate 'id_rsa.pub' file under '/root/.ssh' directory. Now rename the Master's id_rsa.pub to '19_rsa.pub' and copy it to Slave Node (at same path).<br />
Then execute the following command to add the Master's public key to the Slave's authorized keys.<br />
<br />
<strong># cat /root/.ssh/19_rsa.pub >> /root/.ssh/authorized_keys</strong><br />
<br />
Now try to ssh the Slave Node. It should be connected without needing any password.<br />
<br />
<strong># ssh 192.168.100.17</strong><br />
<br />
<strong><span style="color: #38761d;">STEP 8)</span></strong> <strong>Setting up MASTER NODE :</strong><br />
Setup Hadoop to work in a fully distributed mode by configuring the configuration files under the <strong>$HADOOP_HOME/conf/ </strong>directory.<br />
<br />
<strong><span style="color: #38761d;">Configuration Property :</span></strong><br />
<strong>Property Explanation</strong><br />
1) fs.default.name NameNode URI<br />
2) mapred.job.tracker JobTracker URI<br />
3) dfs.replication Number of replication<br />
4) hadoop.tmp.dir (optional) Temp Directory<br />
<br />
<strong>Let us Start with Configuration files :</strong><br />
<br />
1) <strong>$HADOOP_HOME/conf/hadoop-env.sh</strong><br />
make change as...<br />
export JAVA_HOME=/java_installation_folder/jdk1.6.0_12<br />
<br />
2) <strong>$HADOOP_HOME/conf/core-site.xml</strong><br />
<br />
<configuration><br />
<property><br />
<name>fs.default.name</name><br />
<value>hdfs://localhost:9000</value><br />
</property><br />
</configuration><br />
<br />
3)<strong> $HADOOP_HOME/conf/hdfs-site.xml</strong><br />
<br />
<configuration><br />
<property><br />
<name>dfs.replication</name><br />
<value>1</value><br />
</property><br />
</configuration><br />
<br />
4)<strong> $HADOOP_HOME/conf/mapred-site.xml</strong><br />
<br />
<configuration><br />
<property><br />
<name>mapred.job.tracker</name><br />
<value>localhost:9001</value><br />
</property><br />
</configuration><br />
<br />
5) <strong>$HADOOP_HOME/conf/masters</strong><br />
192.168.100.19<br />
<br />
6) <strong>$HADOOP_HOME/conf/slaves</strong><br />
192.168.100.17<br />
<br />
<strong><span style="color: #38761d;">Now copy all these files to /conf directory of SLAVE Machine.</span></strong><br />
<br />
<strong><span style="color: #38761d;">STEP 9) </span></strong><strong>Setup Master and Slave Node : (run on both machines)</strong><br />
<br />
<strong># hadoop namenode -format<br />
# start-all.sh</strong> <br />
<br />
Now your Cluster is Ready to run JobsPranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.com0tag:blogger.com,1999:blog-149678830813188233.post-30377929682494664972010-10-26T03:42:00.000-07:002010-11-11T23:10:06.824-08:00Planet Hadoop...!!!<div align="center"><b><span style="font-size: large;"><span style="color: #38761d;">Hadoop</span></span></b></div><br />
Hadoop is an open source Java platform. It lets one easilt write and run distributed applications on large compter clusters to process vast amounts of data. It implements HDFS [Hadoop Distributed File System].<br />
<br />
<b><span style="color: #38761d;">HDFS:</span></b><br />
<br />
HDFS is a distributed filesystem that runs on large cluster of commodity machines. HDFS divides applications into many blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. Mapreduce can then process the data where it is located. HDFS has a Master / Slave architecture. A HDFS cluster consists of a single NameNode- a master server that manages the file system namespaces & regulates access to files by clients. In addition there are no. of DataNodes, usually one per node in the cluster, which manage storage attached to the modes that they run on. HDFS exposes a file system Namespace & allows user data to be stored in files. Internally a file is split into one or more blocks & these blocks are stored in a set of DataNodes. The NameNode executes file system Namespace operations like opening, closing & renaming files & directories. It also determines mapping of blocks to DataNodes. The DataNodes are responsible for serving read & write requests from the file system's clients. The DataNodes also perform block creation, deletion & replication upon instruction from the NameNode.<br />
<b><br />
</b><br />
<b><span style="color: #38761d;">MapReduce:</span></b><br />
<b> </b> <br />
It is breaking a problem into independent pieces to be worked on in parallel. MapReduce is an abstraction that allows Google's engineers to perform simple computations, while hiding the details of parallelisation, data distribution, load balancing & fault tolerance.<br />
<br />
<b><span style="color: #38761d;">MapReduce as a Programming Model:</span></b><br />
<br />
<span style="color: #38761d;">MAP:</span> it is written by a user of the MapReduce library, takes on input pair & produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.<br />
<br />
<span style="color: #38761d;">REDUCE:</span> it is also written by the user, it accepts intermediate key & a set of values for that key. It merges together these values to form a possibly smaller set of values.<br />
The MapReduce framework consists of a single master Jobtracker and on slave Tasktracker per cluster node. The master is responsible for scheduling the job's component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.<br />
The Hadoop job client then submits the job (jar / executables) & configuraions to the Jobtracker which then assumes the responsibility of distributing the software / configuration to the slaves, scheduling tasks & monitoring them, providing status & diagnostic information to the job client.<br />
<br />
<b><span style="color: #38761d;">PIG :</span></b> A dataflow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.<br />
<br />
<b><span style="color: #38761d;">HBASE : </span></b>A distributed column oriented database. Hbase uses HDFS for its underlying storage and supports batch-style computations using MapReduce & point queries.<br />
<br />
<b><span style="color: #38761d;">Zookeepers :</span></b> A distributed, Highly available coordination service. Zookeeper provides primitives such as distributed locks that can be used for building distributed applications.<br />
<br />
<b><span style="color: #38761d;">Hive :</span></b> A distributed data warehouse. Hive manages data stored in HDFS & provides a query language based on SQL for querying the data.<br />
<br />
<b><span style="color: #38761d;">Chukwa : </span></b>A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS and it uses MapReduce to produce reports.<br />
<br />
<br />
<b><span style="color: #38761d;">Hadoop Default Ports :</span></b><br />
<br />
1) HDFS : Namenode 50070 dfs.http.address<br />
Datanode 50075 dfs.datanode.http.addresss<br />
secondary Namenode 50090 dfs.secondary.http.address<br />
backup/checkpoint node 50105<br />
2) MR : JobTracker 50030 mapred.job.tracker.http.address<br />
Tasktracker 50060 mapred.task.tracker.http.address<br />
<br />
<div align="left"><br />
<br />
<b><span style="font-size: large;"><span style="color: #38761d;">Hadoop Installation On Linux:</span></span></b><br />
<br />
<b><span style="color: #38761d;">Installing Java:</span></b><br />
<br />
before installing hadoop on linux we need to install java 6 or above on linux.<br />
It is available at http://java.sun.com/products/archive/j2se/6u12<br />
I am using <b>jdk-6u12-linux-i586.bin</b><br />
you can choose the version 32 bit or 64 bit as per your configuration and o/s.<br />
I am currently using redhat linux 5 (32 bit).<br />
<br />
Download the java package then allow permissions i.e.<br />
<b># chmod 744 jdk-6u12-linux-i586.bin<br />
# ./jdk-6u12-linux-i586.bin</b><br />
<br />
After that set $JAVA_HOME path i.e.<br />
<b># export JAVA_HOME=<build tool directory>/jdk1.6.0_12</b><br />
<br />
we also need some necessary files-<br />
download jce_policy-6.zip<br />
extract it using...<br />
<b>#unzip jce_policy-6.zip<br />
# cp -f jce/*.jar $JAVA_HOME/jre/lib/security/</b><br />
<br />
<b># chmod 744 $JAVA_HOME/jre/lib/security/*.jar</b><br />
<br />
Installing Hadoop:<br />
<br />
Download hadoop-0.20.2.tar.gz file from<br />
<b>http://www.higherpass.com/linux/</b><br />
<br />
or give command-<br />
<b># wget http://www.gtlib.gatech.edu/pub/apache/hadoop/core/hadoop-0.20.2</b><br />
<br />
After downloading of tar file we need to extract it, for that<br />
<br />
<b># tar -xvzf hadoop-0.20.2.tar.gz</b><br />
<br />
then copy that hadoop installation folder to '/usr/local' location-<br />
<b># cp -r hadoop-0.20.2/ /usr/local</b><br />
<br />
<b><span style="color: #38761d;">Hadoop Setup :</span></b></div><br />
Setup HADOOP_HOme environment cariable to the install directory & append $HADOOP_HOME/bin to PATH environment variable.<br />
<br />
<b># export HADOOP_HOME=/usr/local/hadoop-0.20.2<br />
# export PATH=$PATH:$HADOOP_HOME/bin</b><br />
<br />
Now configure JAVA_HOME path at the '<b>/usr/local/hadoop-0.20.2/conf/hadoop-env.sh</b>' file.<br />
<br />
Hadoop Pseudo Distributed Cluster:<br />
<br />
1) edit <b>/conf/core-site.xml</b> and make changes as...<br />
<br />
<configuration><br />
<property><br />
<name>fs.default.name</name><br />
<value>hdfs://localhost:9000</value><br />
</property><br />
</configuration><br />
<br />
2) edit <b>/conf/hdfs-site.xml</b> and make changes as...<br />
<br />
<configuration><br />
<property><br />
<name>dfs.replication</name><br />
<value>1</value><br />
</property><br />
</configuration><br />
<br />
3) edit <b>/conf/mapred-site.xml</b> and make changes as...<br />
<br />
<configuration><br />
<property><br />
<name>mapred.job.tracker</name><br />
<value>localhost:9001</value><br />
</property><br />
</configuration><br />
<br />
After that setup password less ssh with keys. Hadoop communicates over 'ssh' so we need to setup ssh keys, Using ssh Agents & tunnels.<br />
<br />
<b># ssh-keygen -t dsa -p ' ' -f ~/.ssh/id_dsa<br />
<br />
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys</b><br />
<br />
<br />
Format the HDFS file system to prepare it for use. This creates the needed directory structures for the HDFS filesystem.<br />
<br />
<b># hadoop namenode -format</b><br />
<br />
now everything is configured & setup start the daemons for that give command-<br />
<br />
<b># start-all.sh</b> <br />
<br />
<br />
<span style="font-size: large;"><b><span style="color: #38761d;">Running A Job on Cluster Mode:</span></b></span><br />
<br />
1)create a simple file say “test.txt”<br />
<br />
<b># vim test.txt</b><br />
hello this is a sample file for hadoop<br />
hello this file contains simple data.<br />
<br />
Save this file and exit<br />
<br />
2) Insert this file into hdfs<br />
<b># hadoop dfs -put test.txt /</b><br />
<br />
The wordcount program is present at '<b>/usr/local/hadoop-0.20.2/src/examples/org/apache/hadoop/examples/WordCount.java</b>'<br />
<br />
The Program <b>WordCount.java</b> is:<br />
<br />
<span style="color: #0b5394;">package org.apache.hadoop.examples;<br />
<br />
import java.io.IOException;<br />
import java.util.StringTokenizer;<br />
<br />
import org.apache.hadoop.conf.Configuration;<br />
import org.apache.hadoop.fs.Path;<br />
import org.apache.hadoop.io.IntWritable;<br />
import org.apache.hadoop.io.Text;<br />
import org.apache.hadoop.mapreduce.Job;<br />
import org.apache.hadoop.mapreduce.Mapper;<br />
import org.apache.hadoop.mapreduce.Reducer;<br />
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;<br />
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;<br />
import org.apache.hadoop.util.GenericOptionsParser;<br />
<br />
public class WordCount {<br />
<br />
public static class TokenizerMapper <br />
extends Mapper<Object, Text, Text, IntWritable>{</span><br />
<span style="color: #0b5394;"><br />
private final static IntWritable one = new IntWritable(1);<br />
private Text word = new Text();<br />
<br />
public void map(Object key, Text value, Context context<br />
) throws IOException, InterruptedException {<br />
StringTokenizer itr = new StringTokenizer(value.toString());<br />
while (itr.hasMoreTokens()) {<br />
word.set(itr.nextToken());<br />
context.write(word, one);<br />
}<br />
}<br />
}<br />
<br />
public static class IntSumReducer <br />
extends Reducer<Text,IntWritable,Text,IntWritable> {<br />
private IntWritable result = new IntWritable();<br />
<br />
public void reduce(Text key, Iterable<IntWritable> values, <br />
Context context<br />
) throws IOException, InterruptedException {<br />
int sum = 0;<br />
for (IntWritable val : values) {<br />
sum += val.get();<br />
}<br />
result.set(sum);<br />
context.write(key, result);<br />
}<br />
}</span><br />
<br />
<span style="color: #0b5394;"><br />
public static void main(String[] args) throws Exception {<br />
Configuration conf = new Configuration();<br />
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();<br />
if (otherArgs.length != 2) {<br />
System.err.println("Usage: wordcount <in> <out>");<br />
System.exit(2);<br />
}<br />
Job job = new Job(conf, "word count");<br />
job.setJarByClass(WordCount.class);<br />
job.setMapperClass(TokenizerMapper.class);<br />
job.setCombinerClass(IntSumReducer.class);<br />
job.setReducerClass(IntSumReducer.class);<br />
job.setOutputKeyClass(Text.class);<br />
job.setOutputValueClass(IntWritable.class);<br />
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));<br />
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));<br />
System.exit(job.waitForCompletion(true) ? 0 : 1);<br />
}<br />
}</span><br />
<br />
<br />
And the jar file is present at '/usr/local/hadoop-0.20.2/hadoop-0.20.2-examples.jar'<br />
<br />
<b>To run program give command :</b><br />
<br />
#<b> hadoop jar hadoop-0.20.2-examples.jar wordcount /test.txt /out</b><br />
<br />
You will get following output on screen<br />
<br />
<span style="color: #cc0000;">10/10/27 10:26:16 INFO input.FileInputFormat: Total input paths to process : 1<br />
10/10/27 10:26:16 INFO mapred.JobClient: Running job: job_201010270916_0005<br />
10/10/27 10:26:17 INFO mapred.JobClient: map 0% reduce 0%<br />
10/10/27 10:26:27 INFO mapred.JobClient: map 100% reduce 0%<br />
10/10/27 10:26:39 INFO mapred.JobClient: map 100% reduce 100%<br />
10/10/27 10:26:41 INFO mapred.JobClient: Job complete: job_201010270916_0005<br />
10/10/27 10:26:41 INFO mapred.JobClient: Counters: 17<br />
10/10/27 10:26:41 INFO mapred.JobClient: Job Counters <br />
10/10/27 10:26:41 INFO mapred.JobClient: Launched reduce tasks=1<br />
10/10/27 10:26:41 INFO mapred.JobClient: Launched map tasks=1<br />
10/10/27 10:26:41 INFO mapred.JobClient: Data-local map tasks=1<br />
10/10/27 10:26:41 INFO mapred.JobClient: FileSystemCounters<br />
10/10/27 10:26:41 INFO mapred.JobClient: FILE_BYTES_READ=50<br />
10/10/27 10:26:41 INFO mapred.JobClient: HDFS_BYTES_READ=20<br />
10/10/27 10:26:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=132<br />
10/10/27 10:26:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=28<br />
10/10/27 10:26:41 INFO mapred.JobClient: Map-Reduce Framework<br />
10/10/27 10:26:41 INFO mapred.JobClient: Reduce input groups=4<br />
10/10/27 10:26:41 INFO mapred.JobClient: Combine output records=4<br />
10/10/27 10:26:41 INFO mapred.JobClient: Map input records=1<br />
10/10/27 10:26:41 INFO mapred.JobClient: Reduce shuffle bytes=50<br />
10/10/27 10:26:41 INFO mapred.JobClient: Reduce output records=4<br />
10/10/27 10:26:41 INFO mapred.JobClient: Spilled Records=8<br />
10/10/27 10:26:41 INFO mapred.JobClient: Map output bytes=36<br />
10/10/27 10:26:41 INFO mapred.JobClient: Combine input records=4<br />
10/10/27 10:26:41 INFO mapred.JobClient: Map output records=4<br />
10/10/27 10:26:41 INFO mapred.JobClient: Reduce input records=4</span><br />
<br />
<br />
the output will be stored at /out/part-00000 file, provided the output folder should not be there before running the program. Otherwise you will get an error.<br />
<br />
To remove file from HDFS give command:<br />
<b># hadoop dfs -rm /test.txt</b><br />
<br />
to remove contents of entire directory recursively give command;<br />
<b># hadoop dfs -rmr /</b><br />
<br />
To check the output give command:<br />
<b># hadoop dfs -cat /out/part-00000</b> or <b># hadoop dfs -cat /out/part-r-00000</b><br />
<br />
you get following data stored in it...<br />
<span style="color: #cc0000;">a 1<br />
contains 1<br />
data. 1<br />
file 2<br />
for 1<br />
hadoop 1<br />
hello 2<br />
is 1<br />
sample 1<br />
simple 1<br />
this 2</span><br />
<br />
<br />
To check contents of HDFS file system give command:<br />
<br />
<b># hadoop dfs -ls</b><br />
<br />
<b><span style="color: #38761d;">Explaination of WordCount.java :</span></b><br />
<br />
The program needs 3 classes to run: a Mapper, a Reducer and a Driver. The Driver tells Hadoop how to run the MapReduce process. The Mapper and Reducer operate on data.<br />
<br />
1) <b><span style="color: #38761d;">Mapping List :</span></b> The first phase of MapReduce program is 'Mapping'. A list of data elements are provided, one at a time to a function called, ' Mapper', which transforms each element individually to an output data element.<br />
<br />
2)<b><span style="color: #38761d;"> Reducing List :</span></b> Reducing here means aggregating value together and returns a single output.<br />
<br />
3) <b><span style="color: #38761d;">Keys and Values :</span></b> the mapping & reducing functions receive not just values, but (key, value) pairs. The default input format used by Hadoop presents each line of an Input file as a separate input to the Mapper function and not the entire file at a time.<br />
<br />
4) <b><span style="color: #38761d;">StringTokenizer object :</span></b> used to break up the line into words.<br />
<br />
5) <b><span style="color: #38761d;">Output.collect() : </span></b>this method will copy the values it receives as input, so we are free to overwrite variables we use.<br />
<br />
6) <b><span style="color: #38761d;">Driver Method : </span></b>there is one final component of a Hadoop MapReduce program called, ' Driver'. The Driver initiates the job & instructs the Hadoop platform to execute your code on a set of input files & controls where the output files are placed.<br />
<br />
7) <b><span style="color: #38761d;">InputPath Argument :</span></b> given input directory.<br />
<br />
8) <b><span style="color: #38761d;">OutputPath Argument :</span></b> Directory in which output from reducers are written into files.<br />
<br />
9)<b><span style="color: #38761d;"> JobClient object :</span></b> it captures the configuration information to run job. Mapping and Reducing functions are identified by setMapperClass() & setreducerClass() methods.<br />
Data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass(). If this is not the case, the methods of the JobConf class will override these. The input types fed to the Mapper are controlled by the InputFormat used. The default Input Format, “TextInputFormat” will load data in as (LongWritable, Text) pairs. The Long value is byte offset of the line in file. The text object holds the string contents of the line of the file.<br />
The text object holds the string contents of the line of the file. The call to JobClient.runJob(conf) will submit the job to Mapreduce. This will block until the job completes. If the job fails, it will throw an IOException. JobClient also provides a non-blocking version called, 'submitJob()'.<br />
<br />
10)<b><span style="color: #38761d;"> InputFormat :</span></b> it is a class that provides following functionality:-<br />
a) Selects the files or other objects that should be used for input.<br />
b) Defines the InputSplits that break a file into tasks.<br />
c) Provides a factory for RecordReader Object that read the file.<br />
e.g. FileInputFormat : it is provided with a path containing files to read. The FileInputFormat will read all files in this directory. Then divides these files into one or more InputSplit each. We can choose which InputFormat to apply to our input files for a job by calling the setInputFormat() method of the JobConf object that defines the job.<br />
<br />
11)<b><span style="color: #38761d;"> Input Split : </span></b>An InputSplit describes a unit of work that comprises a single map task in a MapReduce program.<br />
<br />
12)<b><span style="color: #38761d;"> RecordReader :</span></b> the InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source & converts it into (key,value) pairs, suitable for reading by the Mapper. The Recordreader instance is defined by InputFormat. The default InputFormat is TextInputFormat, provides a LineRecordReader, which treats each line of the Input file as a new value. The RecordReader is invoke repeatedly on the input until the entire InputSplit has been consumed. Each invocation of the RecordReader leads to another call to to the map() method of the Mapper.<br />
<br />
13) <b><span style="color: #38761d;">Mapper : </span></b>Given a key and value the map() method emits (key,value) pairs which are forwarded to the Reducers. The map() method receives 2 parameters in addition to the key & the value.<br />
a) the OutputCollector object has a method named collect() which will forward a (key,value) pair to the reduce phase of the job.<br />
b) the Reporter object provides information about the current task; its getInputSplit() method will return an object describing the current InputSplit. The setStatus() method allows you to emit a status message back to the user. The incrCounter() method allows you to increment shared performance counters. Each Mapper can increment the counters & JobTracker will collect the increments made by different processes & aggregate them for later retrieval when the job ends.<br />
<br />
14) <b><span style="color: #38761d;">Partition and Shuffle :</span></b> the process of moving map outputs to the reducers is known as, 'Shuffling'. The Partitioner class determines which partition a given (key, value) pair will go to.<br />
<br />
15)<b><span style="color: #38761d;"> Sort :</span></b> each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted to the Reducer.<br />
16) Reducer : a Reducer instance is created for each reduce task. For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. The Reducer also receives as parameters 'OutputCollector' and Reporter objects; they are used in the same manner as in the map() method.<br />
<br />
17) <b><span style="color: #38761d;">OutputFormat : </span></b>the (key,value) pairs provided to this OutputCollector are then written to output files. The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS. The output directory is set by the FileOutputFormat.setOutputPath() method. We can control which particular OutputFormat is used by calling the setOutputFormat() method of the JobConf object that defines your MapReduce job.<br />
<br />
18)<b><span style="color: #38761d;"> RecordWriter : </span></b>these are used to write the individual records to the files as directed by OutputFormat.<br />
<br />
19) <b><span style="color: #38761d;">Combiner :</span></b> it runs after Mapper & before Reducer. Its usage is optional.<br />
<br />
<strong> <span style="color: #38761d;"> WordCount.java Program Description</span></strong><br />
<br />
<strong><span style="color: #38761d;">a) Class TokenizerMapper :</span></strong> this class is having methods -<br />
I) map (object key, Text value, Mapper.Content content)<br />
It is called once per each key / value pair in the input split.<br />
II) StringTokenizer class : it is used to split the string<br />
<br />
<strong><span style="color: #38761d;">b) Class IntSumReducer : </span></strong>it is having method -<br />
reduce (Key key, Iterable <IntWritable> values, Reducer.Content content)<br />
This method is called once for each key. Most applications will define their reduce class by overriding this method.<br />
<br />
<strong><span style="color: #38761d;">c) Class GenericOptionParser : </span></strong>It is a utility to parse command line arguments generic to the Hadoop framework. This class recognizes several standard command line arguments, enabling applications to specify a namenode, a jobtracker, additional configuration resources etc.<br />
<br />
<strong><span style="color: #38761d;">d) GenericOptionsParser (Configuration conf, String [] args) :</span></strong> this create a GenericOptionsParser to parse only the generic Hadoop arguments.<br />
Method : getPemainingArgs () : It returns an array of strings containing only application specific arguments.<br />
<br />
<strong><span style="color: #38761d;">e) Job Class :</span></strong><br />
<span style="color: #38761d;"><strong>Methods :</strong> </span><strong> </strong><br />
<strong>I) FileInputFormat.addInputPath (Job job, Path path) :</strong> Add a path to the list of inputs for the map-reduce job.<br />
<strong>II) FileOutputFormat.setOutputPath ( Job job, Path outputDir) :</strong> set the path of the output directory for the map-reduce job. <br />
<strong>III) setMapperClass () : </strong>sets the applications mapper class.<br />
<strong>IV) setCombinerClass() : </strong>set the combiner class for the job.<br />
<strong>V) setJarByClass() : </strong>set the Jars by finding where a given class came from.<br />
<strong>VI) setReducerClass() :</strong> set the Reducer for the job.<br />
<strong>VII) setOutputKeyClass () : </strong>set the key class for the job output data.<br />
<strong>VIII) setOutputValueClass() :</strong> set the value class for the job outputs.<br />
<strong>IX) waitForCompletion() : </strong>Submit the job to the cluster and wait for it to finish.<br />
<br />
<strong><span style="color: #38761d;"> Simple Map Reduce Program</span></strong><br />
<br />
map ( string key, String value) :<br />
// key : document name<br />
// value : document contents<br />
for each word x in value :<br />
EmitIntermediate (x, “1”);<br />
<br />
reduce (String key, Iterator values) :<br />
// key : a word<br />
// value : a list of counts<br />
int result = 0; <br />
for each v in values :<br />
result t = ParseInt (v);<br />
Emit (AsString (result));<br />
<br />
i.e. the map() emits each word plus an associated count of occurrences.<br />
The reduce() sums together all counts emitted for a particular word.<br />
In addition we need to write code to fill a mapreduce specification object with the names of the input and output files, and optional tuning parameters. The user then invokes the “mapreduce()” passing it he specification object. The users code is linked together with MapReduce library.<br />
<br />
<strong><span style="color: #38761d;">Inverted Index : </span></strong>The map function parses each document and emits a sequence of (word, document Id) pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document Ids and emits a (word, list (document Id)) pair. The set of all output pairs forms a single inverted index.<br />
<br />
<br />
<br />
<br />
<b><span style="color: #38761d;"> Designing HADOOP CLUSTER</span></b><br />
<br />
It Needs 3 Linux Machines to make a cluster. (Even you can make it on 2 machines but 3 are better).<br />
The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster must have the same HADOOP_HOME path. Export HADOOP_HOME in login script <b>/etc/bashrc</b> to make it persistent.<br />
Hadoop configuration is driven by 2 configuration files in HADOOP_HOME/conf directory. The default configuration settings appear in the read-only '<b>hadoop-default.xml</b>' file. Node specific configuration settings appear in the '<b>hadoop-site.xml</b>'.<br />
Another important file is <b>conf/slaves</b>. On the JobTracker this file lists all the hosts on which the TaskTracker daemon has to be started. On NameNode it lists all the hosts on which the DataNode daemon has to be started. You must maintain this file manually, even if you are scaling uoto a large no. of nodes.<br />
<br />
Finally <b>conf/hadoop-env.sh</b> contains configuration options such as JAVA_HOME, the location of logs & the directory where process Id's are stored. NameNode & JobTracker on 2 separate nodes & DataNode & TaskTracker on a third node. [ In case of 2 machines you can install NameNode & JobTracker on one machine and DataNode & TaskTracker on second machine].<br />
The conf/slaves file on first 2 nodes contained the Ip address of the 3rd machine(NameNode). All 4 daemons used the same conf/hadoop-site.xml file. Specifically hadoop-site-namenode.xml<br />
<br />
<b>Setup PassPhrase SSH :</b><br />
<br />
If you want to use hostnames to refer to the nodes, you must also edit the /etc/hosts file on each node to reflect the proper mapping between the hostnames & Ip addresses.<br />
<br />
<b><span style="color: #38761d;">Hadoop Startup :</span></b><br />
To start cluster you need to start both the HDFS & MapReduce.<br />
<br />
1) First on NameNode :<br />
Navigate to HADOOP_HOME<br />
2) Format a new Distributed filesystem using,<br />
<b>$ hadoop namenode -format</b><br />
3) Start HDFS by running command on NameNode, <br />
<b>$ start-all.sh</b><br />
this script also consult conf/slaves file on NameNode and start the DataNode daemon on all the listed slaves.<br />
4) Start MapReduce with following command on the designated JobTracker.<br />
<b>$ start-mapred.sh</b><br />
this script also consult the conf/slaves file on the JobTracker & starts the TaskTracker daemon on all the listed slaves.<br />
5) To cross check whether the cluster is running properly, you can look at the process running on each node, using jps, on NameNode you should see the processes Jps, NameNode and if you have only a 3 node cluster, SecondaryNameNode. <br />
6) On JobTracker check for Jps and JobTracker.<br />
7) On TaskTracker / DataNode you should see Jps, DataNode and TaskTracker.<br />
<br />
<b><span style="color: #38761d;">Running MapReduce Jobs:</span></b><br />
<b><span style="color: #38761d;"> </span></b><br />
Once you have Hadoop Cluster running, you can see it in action by executing one of the example MapReduce Java class files bundled in hadoop-0.20.2-examples.jar. As an example take Grep which extracts matching strings from text files and counts how many time they occurred. To begin create an input set for Grep. In this case input will be a set of files in the /conf directory.<br />
<br />
<b># hadoop dfs -copyFromLocal conf input<br />
# hadoop dfs -ls<br />
# hadoop jar hadoop-0.20.2-examples.jar grep input grep-output 'dfs[a-z]+'<br />
# hadoop dfs -get grep-output output<br />
# cd output<br />
# ls</b><br />
<br />
<span style="font-size: large;"><b><span style="color: #38761d;"> Some Important Hadoop Commands</span></b></span><b><br />
<br />
# start-all.sh<br />
</b>this command starts all the parameters like jobtracker, namenode, task tracker, datanode etc. on Hadoop cluster<b><br />
<br />
# hadoop dfs -put sample.txt input<br />
</b>this command inserts 'sample.txt' file into HDFS (/user/root/input folder).<b><br />
<br />
# hadoop dfs -rm input/sample.txt</b>Used to delete sample.txt from HDFS.<b><br />
</b>To remove all files in input directory recursively give command :<b><br />
# hadoop dfs -rmr input<br />
</b>To run java program on Hadoop give command :<b><br />
# hadoop jar hadoop-0.20.2-examples.jar wordcount input/sample.txt /out<br />
</b>the output will be stored at <b>'/user/root/out/part-00000' </b>file or <b>'/user/root/out/part-r-00000' </b>file<b><br />
</b><br />
<b><span style="color: #38761d;"><span style="font-size: large;"><br />
</span></span></b>Pranay Malekarhttp://www.blogger.com/profile/14655422736467585126noreply@blogger.com0