Planet Hadoop...!!!: 2011

Thursday, April 14, 2011

PIG LATIN

Pig Latin

A Pig Latin is made of a series of operations, or transformations, that are applied to the input data to produce output.
Under the cover pig turns the transformations into a series of MapReduce jobs, but as a programmer you are mostly unaware of this, which allows you to focus on the data rather than the nature of the execution.

Pig runs in 2 modes :
1) Local Mode
2) Hadoop Mode

1) Local Mode : In local mode Pig runs in a single JVM & accesses the local file system. This mode is suitable only for small datasets & when trying out Pig. Local mode doesn't use Hadoop. Also it doesn't use Hadoop's local job runner, instead Pig translates queries into a physical plan that it executes itself. The execution type is set using the -x or -exectype option. To run in local mode, set the option to local:
$ pig -x local

2) Hadoop Mode : In Hadoop mode, Pig translates queries into MapReduce jobs & runs them on a Hadoop cluster. To use Hadoop mode you need to tell Pig which vesion of Hadoop you are using & where your cluster is running.
The Environment variable PIG_HADOOP_VERSION is used to tell Pig the version of Hadoop it is connecting to.
$ export PIG_HADOOP_VERSION = 20

Next we need to point Pig at the cluster namenode & jobtracker. If you already have Hadoop site file that define fs.default.name & mapred.jobtracker you can simply add Hadoop's configuration directory to Pig's classpath :
$ export PIG_CLASSPATH = $HADOOP_INSTALL/conf/

Alternatively ou can create a pig.properties file in Pig's “conf” directory, which sets these two properties. Here is an example for a pseudo-distributed setup :
fs.default.name=hdfs://localhost/
mapred.jobtracker= localhost:8021

once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting the -x option to MapReduce or omitting it entirely, as Hadoop mode is the default:

/bin/pigscr file
#!/bin/sh
PIG_PATH = $HADOOP_HOME/bin/pig-0.7.0
PIG_CLASSPATH = $PIG_PATH/pig-0.3.0-core.jar:$HADOOP_HOME/conf \ PIG_HADOOP_VERSION = 0.20.2 \ $PIG_PATH/bin/pig $@

Tuesday, April 12, 2011

Installing PIG

Installing PIG

To install Pig on Linux we Need to install following Packages :

1) Install Hadoop 0.20.2 ( or Later)
2) Java 1.6 or Later ( Set JAVA_HOME )
3) Ant 1.7 ( optional for builds )
4) Junit 4.5 ( optional for Unit tests)

Download PIG from one of the apache download mirror
http://pig.apache.org/releases.html

Unpack the downloaded PIG distribution. The pig script is located in the bin directory.
Add “/pig-n.n.n/bin” to your path. Use export (bash,sh,ksh) or
setenv (tcsh,csh)

export PATH=/usr/local/Hadoop-0.20.2/bin/pig-0.7.0/bin:$PATH

TRY this also #pig -help
#pig ...(to start grunt)

Writing Scripts

Copy “/etc/passwd” file to /root write script “id.pig” as follows :
# vim /root/id.pig

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
STORE B into '$out';

save this file and exit

copy 'passwd' & 'id.pig' files in a directory suppose /root/inpig
then,
# Hadoop dfs -put /root/inpig (to insert your data in HDFS)
grunt > run -param out=myoutput id.pig ... to run the script

Now output will b saved at '/user/root/myoutput/part-m-00000' file.

Pig sample commands and their results
here is sample data '/data/one' file contains :
a A 1
b B 2
c C 3
a AA 11
a AAA 111
b BB 22

And '/data/two' file contains :
x X a
y Y b
x XX b
z Z c

So the sample script is
# vim test1.pig
one = load 'data/one' using PigStorage();
two = load 'data/two' using PigStorage();

generated = FOREACH one GENERATE $0, $2;

save & exit

RESULT :
(a, 1)
(b, 2)
(c, 3)
(a, 11)
(a, 111)
(b, 22)

Other Commands and their Results are as follows :

grouped = GROUP one BY $0;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)})
(b, {(b, B, 2), (b, BB, 22)})
(c, {(c, C, 3)})

grouped2 = GROUP one BY ($0, $1);
((a, A), {(a, A, 1)})
((a, AA), {(a, AA, 11)})
((a, AAA), {(a, AAA, 111)})
((b, B), {(b, B, 2)})
((b, BB), {(b, BB, 22)})
((c, C), {(c, C, 3)})

summed = FOREACH grouped GENERATE group, SUM(one.$2);
(a, 123.0)
(b, 24.0)
(c, 3.0)

counted = FOREACH grouped GENERATE group, COUNT(one);
(a, 3)
(b, 2)
(c, 1)

flat = FOREACH grouped GENERATE FLATTEN(one);
(a, A, 1)
(a, AA, 11)
(a, AAA, 111)
(b, B, 2)
(b, BB, 22)
(c, C, 3)

cogrouped = COGROUP one BY $0, two BY $2;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)}, {(x, X, a)})
(b, {(b, B, 2), (b, BB, 22)}, {(y, Y, b), (x, XX, b)})
(c, {(c, C, 3)}, {(z, Z, c)})

flatc = FOREACH cogrouped GENERATE FLATTEN(one.($0,$2)), FLATTEN(two.$1);
(a, 1, X)
(a, 11, X)
(a, 111, X)
(b, 2, Y)
(b, 22, Y)
(b, 2, XX)
(b, 22, XX)
(c, 3, Z)

joined = JOIN one BY $0, two BY $2;
(a, A, 1, x, X, a)
(a, AA, 11, x, X, a)
(a, AAA, 111, x, X, a)
(b, B, 2, y, Y, b)
(b, BB, 22, y, Y, b)
(b, B, 2, x, XX, b)
(b, BB, 22, x, XX, b)
(c, C, 3, z, Z, c)

crossed = CROSS one, two;
(a, AA, 11, z, Z, c)
(a, AA, 11, x, XX, b)
(a, AA, 11, y, Y, b)
(a, AA, 11, x, X, a)
(c, C, 3, z, Z, c)
(c, C, 3, x, XX, b)
(c, C, 3, y, Y, b)
(c, C, 3, x, X, a)
(b, BB, 22, z, Z, c)
(b, BB, 22, x, XX, b)
(b, BB, 22, y, Y, b)
(b, BB, 22, x, X, a)
(a, AAA, 111, x, XX, b)
(b, B, 2, x, XX, b)
(a, AAA, 111, z, Z, c)
(b, B, 2, z, Z, c)
(a, AAA, 111, y, Y, b)
(b, B, 2, y, Y, b)
(b, B, 2, x, X, a)
(a, AAA, 111, x, X, a)
(a, A, 1, z, Z, c)
(a, A, 1, x, XX, b)
(a, A, 1, y, Y, b)
(a, A, 1, x, X, a)

SPLIT one INTO one_under IF $2 < 10, one_over IF $2 >= 10;
-- one_under:
(a, A, 1)
(b, B, 2)
(c, C, 3)

Thursday, March 17, 2011

PIG

Pig scripts can be run in two modes – a) Local mode
b) Hadoop Mode
1) Local Mode : To run the scripts in local mode, Hadoop or HDFS installation is not required. All files are installed & run from your local host & file system.
2) Hadoop Mode : To run the scripts in Hadoop ( MapReduce ) mode, we need access to a Hadoop cluster & HDFS installation available through Hadoop Virtual machine.
Pig tutorial files are installed on the Hadoop Virtual machine under “/home/hadoop-user/pig” directory.

Getting Started :
1) Install java
2) Download Pig tutorial file & install Pig
3) Run the Pig scripts – in local mode or on a Hadoop mode.

Pig is a platform for analyzing large data sets that consists of a large high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substancial parallelization, which in turns enavles them to handle very large data sets.
Pigs infrastructure layer consists of a compiler that produces sequences of map-reduce programs, for which large scale parallel implementations already exists. Pigs languages layer currently consiste of a textual language called, “Pig Latin”, which has the following key properties.

1) Ease of Programming :
It is a trivial to achieve parallel execution of simple, 'embarrassingly parallel' data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand & maintain.

2) Optimization opportunities :
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus semantics rather than efficiency.

3) Extensibility :
Users can create their own functions to do special purpose processing.

Pig is a system for processing large semistructured data sets using Hadoop MapReduce platform.

Pig Latin : High level procedural language.

Pig Engine : Parser, optimizer & distributed query execution.

Example WordCount using PIG
file name : wordcount.pig

myinput = load '/user/wc.txt' USING TextLoader() as (text_line:chararray);

words = FOREACH myinput GENERATE FLATTEN (TOKENSIZE ($0));

grouped = GROUP words BY $0;
grouped schema : { (group,words) }
counts = FOREACH grouped GENERATE group, COUNT (words);

store counts into '/user/pigoutput' using PigStorage();

save file and quit

Pig is a higher level Perl script.

# java -Xmx1024M -cp pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main wordcount.pig

[ set path of “conf dir” in $HADOOP_CONF_DIR variable ].