Planet Hadoop...!!!: March 2011

PIG

Pig scripts can be run in two modes – a) Local mode
b) Hadoop Mode
1) Local Mode : To run the scripts in local mode, Hadoop or HDFS installation is not required. All files are installed & run from your local host & file system.
2) Hadoop Mode : To run the scripts in Hadoop ( MapReduce ) mode, we need access to a Hadoop cluster & HDFS installation available through Hadoop Virtual machine.
Pig tutorial files are installed on the Hadoop Virtual machine under “/home/hadoop-user/pig” directory.

Getting Started :
1) Install java
2) Download Pig tutorial file & install Pig
3) Run the Pig scripts – in local mode or on a Hadoop mode.

Pig is a platform for analyzing large data sets that consists of a large high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substancial parallelization, which in turns enavles them to handle very large data sets.
Pigs infrastructure layer consists of a compiler that produces sequences of map-reduce programs, for which large scale parallel implementations already exists. Pigs languages layer currently consiste of a textual language called, “Pig Latin”, which has the following key properties.

1) Ease of Programming :
It is a trivial to achieve parallel execution of simple, 'embarrassingly parallel' data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand & maintain.

2) Optimization opportunities :
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus semantics rather than efficiency.

3) Extensibility :
Users can create their own functions to do special purpose processing.

Pig is a system for processing large semistructured data sets using Hadoop MapReduce platform.

Pig Latin : High level procedural language.

Pig Engine : Parser, optimizer & distributed query execution.

Example WordCount using PIG
file name : wordcount.pig

myinput = load '/user/wc.txt' USING TextLoader() as (text_line:chararray);

words = FOREACH myinput GENERATE FLATTEN (TOKENSIZE ($0));

grouped = GROUP words BY $0;
grouped schema : { (group,words) }
counts = FOREACH grouped GENERATE group, COUNT (words);

store counts into '/user/pigoutput' using PigStorage();

save file and quit

Pig is a higher level Perl script.

# java -Xmx1024M -cp pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main wordcount.pig

[ set path of “conf dir” in $HADOOP_CONF_DIR variable ].

Thursday, March 17, 2011