Tuesday, April 12, 2011

Installing PIG


                                   Installing PIG

To install Pig on Linux we Need to install following Packages :

 
1) Install Hadoop 0.20.2 ( or Later)
2) Java 1.6 or Later ( Set JAVA_HOME )
3) Ant 1.7 ( optional for builds )
4) Junit 4.5 ( optional for Unit tests)


Download PIG from one of the apache download mirror
http://pig.apache.org/releases.html

Unpack the downloaded PIG distribution. The pig script is located in the bin directory.
Add “/pig-n.n.n/bin” to your path. Use export (bash,sh,ksh) or
setenv (tcsh,csh)

export PATH=/usr/local/Hadoop-0.20.2/bin/pig-0.7.0/bin:$PATH

TRY this also #pig -help
#pig                              ...(to start grunt)

                                   Writing Scripts

Copy “/etc/passwd” file to /root write script “id.pig” as follows :
# vim /root/id.pig

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
STORE B into '$out';


save this file and exit

copy 'passwd' & 'id.pig' files in a directory suppose /root/inpig
then,
# Hadoop dfs -put /root/inpig                      (to insert your data in HDFS)
grunt > run -param out=myoutput id.pig          ... to run the script

Now output will b saved at '/user/root/myoutput/part-m-00000' file.

                    Pig sample commands and their results
here is sample data '/data/one' file contains :
a A 1
b B 2
c C 3
a AA 11
a AAA 111
b BB 22

And '/data/two' file contains :
x X a
y Y b
x XX b
z Z c

So the sample script is
# vim test1.pig
one = load 'data/one' using PigStorage();
two = load 'data/two' using PigStorage();

generated = FOREACH one GENERATE $0, $2;


save & exit

RESULT :
(a, 1)
(b, 2)
(c, 3)
(a, 11)
(a, 111)
(b, 22)

Other Commands and their Results are as follows :

grouped = GROUP one BY $0;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)})
(b, {(b, B, 2), (b, BB, 22)})
(c, {(c, C, 3)})

grouped2 = GROUP one BY ($0, $1);
((a, A), {(a, A, 1)})
((a, AA), {(a, AA, 11)})
((a, AAA), {(a, AAA, 111)})
((b, B), {(b, B, 2)})
((b, BB), {(b, BB, 22)})
((c, C), {(c, C, 3)})

summed = FOREACH grouped GENERATE group, SUM(one.$2);
(a, 123.0)
(b, 24.0)
(c, 3.0)

counted = FOREACH grouped GENERATE group, COUNT(one);
(a, 3)
(b, 2)
(c, 1)

flat = FOREACH grouped GENERATE FLATTEN(one);
(a, A, 1)
(a, AA, 11)
(a, AAA, 111)
(b, B, 2)
(b, BB, 22)
(c, C, 3)

cogrouped = COGROUP one BY $0, two BY $2;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)}, {(x, X, a)})
(b, {(b, B, 2), (b, BB, 22)}, {(y, Y, b), (x, XX, b)})
(c, {(c, C, 3)}, {(z, Z, c)})

flatc = FOREACH cogrouped GENERATE FLATTEN(one.($0,$2)), FLATTEN(two.$1);
(a, 1, X)
(a, 11, X)
(a, 111, X)
(b, 2, Y)
(b, 22, Y)
(b, 2, XX)
(b, 22, XX)
(c, 3, Z)

joined = JOIN one BY $0, two BY $2;
(a, A, 1, x, X, a)
(a, AA, 11, x, X, a)
(a, AAA, 111, x, X, a)
(b, B, 2, y, Y, b)
(b, BB, 22, y, Y, b)
(b, B, 2, x, XX, b)
(b, BB, 22, x, XX, b)
(c, C, 3, z, Z, c)

crossed = CROSS one, two;
(a, AA, 11, z, Z, c)
(a, AA, 11, x, XX, b)
(a, AA, 11, y, Y, b)
(a, AA, 11, x, X, a)
(c, C, 3, z, Z, c)
(c, C, 3, x, XX, b)
(c, C, 3, y, Y, b)
(c, C, 3, x, X, a)
(b, BB, 22, z, Z, c)
(b, BB, 22, x, XX, b)
(b, BB, 22, y, Y, b)
(b, BB, 22, x, X, a)
(a, AAA, 111, x, XX, b)
(b, B, 2, x, XX, b)
(a, AAA, 111, z, Z, c)
(b, B, 2, z, Z, c)
(a, AAA, 111, y, Y, b)
(b, B, 2, y, Y, b)
(b, B, 2, x, X, a)
(a, AAA, 111, x, X, a)
(a, A, 1, z, Z, c)
(a, A, 1, x, XX, b)
(a, A, 1, y, Y, b)
(a, A, 1, x, X, a)

SPLIT one INTO one_under IF $2 < 10, one_over IF $2 >= 10;
-- one_under:

(a, A, 1)
(b, B, 2)
(c, C, 3)

No comments:

Post a Comment