Thursday, April 14, 2011

PIG LATIN


                             Pig Latin

A Pig Latin is made of a series of operations, or transformations, that are applied to the input data to produce output.
                          Under the cover pig turns the transformations into a series of MapReduce jobs, but as a programmer you are mostly unaware of this, which allows you to focus on the data rather than the nature of the execution.

Pig runs in 2 modes :
1) Local Mode
2) Hadoop Mode

1) Local Mode : In local mode Pig runs in a single JVM & accesses the local file system. This mode is suitable only for small datasets & when trying out Pig. Local mode doesn't use Hadoop. Also it doesn't use Hadoop's local job runner, instead Pig translates queries into a physical plan that it executes itself. The execution type is set using the -x or -exectype option. To run in local mode, set the option to local:
$ pig -x local

2) Hadoop Mode : In Hadoop mode, Pig translates queries into MapReduce jobs & runs them on a Hadoop cluster. To use Hadoop mode you need to tell Pig which vesion of Hadoop you are using & where your cluster is running.
The Environment variable PIG_HADOOP_VERSION is used to tell Pig the version of Hadoop it is connecting to.
$ export PIG_HADOOP_VERSION = 20

                      Next we need to point Pig at the cluster namenode & jobtracker. If you already have Hadoop site file that define fs.default.name & mapred.jobtracker you can simply add Hadoop's configuration directory to Pig's classpath :
$ export PIG_CLASSPATH = $HADOOP_INSTALL/conf/
 
                         Alternatively ou can create a pig.properties file in Pig's “conf” directory, which sets these two properties. Here is an example for a pseudo-distributed setup :
fs.default.name=hdfs://localhost/
mapred.jobtracker= localhost:8021

 
                    once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting the -x option to MapReduce or omitting it entirely, as Hadoop mode is the default:

                                 /bin/pigscr file
#!/bin/sh
PIG_PATH = $HADOOP_HOME/bin/pig-0.7.0
PIG_CLASSPATH = $PIG_PATH/pig-0.3.0-core.jar:$HADOOP_HOME/conf \ PIG_HADOOP_VERSION = 0.20.2 \ $PIG_PATH/bin/pig $@

Tuesday, April 12, 2011

Installing PIG


                                   Installing PIG

To install Pig on Linux we Need to install following Packages :

 
1) Install Hadoop 0.20.2 ( or Later)
2) Java 1.6 or Later ( Set JAVA_HOME )
3) Ant 1.7 ( optional for builds )
4) Junit 4.5 ( optional for Unit tests)


Download PIG from one of the apache download mirror
http://pig.apache.org/releases.html

Unpack the downloaded PIG distribution. The pig script is located in the bin directory.
Add “/pig-n.n.n/bin” to your path. Use export (bash,sh,ksh) or
setenv (tcsh,csh)

export PATH=/usr/local/Hadoop-0.20.2/bin/pig-0.7.0/bin:$PATH

TRY this also #pig -help
#pig                              ...(to start grunt)

                                   Writing Scripts

Copy “/etc/passwd” file to /root write script “id.pig” as follows :
# vim /root/id.pig

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
STORE B into '$out';


save this file and exit

copy 'passwd' & 'id.pig' files in a directory suppose /root/inpig
then,
# Hadoop dfs -put /root/inpig                      (to insert your data in HDFS)
grunt > run -param out=myoutput id.pig          ... to run the script

Now output will b saved at '/user/root/myoutput/part-m-00000' file.

                    Pig sample commands and their results
here is sample data '/data/one' file contains :
a A 1
b B 2
c C 3
a AA 11
a AAA 111
b BB 22

And '/data/two' file contains :
x X a
y Y b
x XX b
z Z c

So the sample script is
# vim test1.pig
one = load 'data/one' using PigStorage();
two = load 'data/two' using PigStorage();

generated = FOREACH one GENERATE $0, $2;


save & exit

RESULT :
(a, 1)
(b, 2)
(c, 3)
(a, 11)
(a, 111)
(b, 22)

Other Commands and their Results are as follows :

grouped = GROUP one BY $0;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)})
(b, {(b, B, 2), (b, BB, 22)})
(c, {(c, C, 3)})

grouped2 = GROUP one BY ($0, $1);
((a, A), {(a, A, 1)})
((a, AA), {(a, AA, 11)})
((a, AAA), {(a, AAA, 111)})
((b, B), {(b, B, 2)})
((b, BB), {(b, BB, 22)})
((c, C), {(c, C, 3)})

summed = FOREACH grouped GENERATE group, SUM(one.$2);
(a, 123.0)
(b, 24.0)
(c, 3.0)

counted = FOREACH grouped GENERATE group, COUNT(one);
(a, 3)
(b, 2)
(c, 1)

flat = FOREACH grouped GENERATE FLATTEN(one);
(a, A, 1)
(a, AA, 11)
(a, AAA, 111)
(b, B, 2)
(b, BB, 22)
(c, C, 3)

cogrouped = COGROUP one BY $0, two BY $2;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)}, {(x, X, a)})
(b, {(b, B, 2), (b, BB, 22)}, {(y, Y, b), (x, XX, b)})
(c, {(c, C, 3)}, {(z, Z, c)})

flatc = FOREACH cogrouped GENERATE FLATTEN(one.($0,$2)), FLATTEN(two.$1);
(a, 1, X)
(a, 11, X)
(a, 111, X)
(b, 2, Y)
(b, 22, Y)
(b, 2, XX)
(b, 22, XX)
(c, 3, Z)

joined = JOIN one BY $0, two BY $2;
(a, A, 1, x, X, a)
(a, AA, 11, x, X, a)
(a, AAA, 111, x, X, a)
(b, B, 2, y, Y, b)
(b, BB, 22, y, Y, b)
(b, B, 2, x, XX, b)
(b, BB, 22, x, XX, b)
(c, C, 3, z, Z, c)

crossed = CROSS one, two;
(a, AA, 11, z, Z, c)
(a, AA, 11, x, XX, b)
(a, AA, 11, y, Y, b)
(a, AA, 11, x, X, a)
(c, C, 3, z, Z, c)
(c, C, 3, x, XX, b)
(c, C, 3, y, Y, b)
(c, C, 3, x, X, a)
(b, BB, 22, z, Z, c)
(b, BB, 22, x, XX, b)
(b, BB, 22, y, Y, b)
(b, BB, 22, x, X, a)
(a, AAA, 111, x, XX, b)
(b, B, 2, x, XX, b)
(a, AAA, 111, z, Z, c)
(b, B, 2, z, Z, c)
(a, AAA, 111, y, Y, b)
(b, B, 2, y, Y, b)
(b, B, 2, x, X, a)
(a, AAA, 111, x, X, a)
(a, A, 1, z, Z, c)
(a, A, 1, x, XX, b)
(a, A, 1, y, Y, b)
(a, A, 1, x, X, a)

SPLIT one INTO one_under IF $2 < 10, one_over IF $2 >= 10;
-- one_under:

(a, A, 1)
(b, B, 2)
(c, C, 3)