Using Hadoop Cluster in Astronomy
Astrophysics is a branch of Physics dealing with many fundamental things about the nature of the universe. e.g. Studying properties of dark matter & the nature of dark matter & the nature of dark energy.
To accomplish this goal it requires new methodologies for analysing & understanding petascale data sets ( i.e. data being collected at a rate 1000X greater than current surveys.) This research focusses on exploring an emerging paradigm for data intensive applications MapReduce & how it scales to the analysis of astronomical images.
Map Reduce in Astronomy :
In order to exploit the elastic nature of the computational cloud, where many computers can be used at the same time, one requires an efficient way of writing parallel programs. The High Performance Computing [HPC] community has been developing such programs for roughly 20 years. The mapreduce model allows programmers to write a map function, which takes a key / value pair (e.g. id and file-name), operates on it (performs object detection) and returns a new set of key / value pairs (source list). The reduce function then aggregates / merges all the intermediate data (builds an object catalog on a stacked source list). Many problems in astronomy naturally fall into this model because of the inherent parallelizability of many astronomical tasks. The benefits are thatmapreduce is easy to write & the framework provides automatic load balancing.
1] Image Mosaicing is a general tool required by Astronomers. The SLOAN DIGITAL SURVEY [ SDSS] alone generated 1.3 million astronomical images. Combining these files to form larger composite images or to stack the individual images to detect faint sources enables a broad range of science questions to be detected (from the detection of moving asteroids that are too faint to be seen on a single image to the identification of very faint, high-red shift galaxies).
Next generation astronomical surveys such as the LSST will generate 30 TB of images per night, detecting transient sources, moving objects & hundreds of millions of stars & galaxies.
The mapreduce model specifies a computation that takes a set of input key / value pairs & produces a set of output values. We divide the computation into 3 distinct phases: Map, Reduce & Shuffle.
Shuffle is optional capability of Reduce mechanism, but since we make heavy use of it, we will therefore consider it a unique phase.
Map takes an input pair & 'emits' a set of intermediate key / value pairs. These intermediate pairs are then shuffled among processors by means of a user-supplied partitioning function. The reduce method then operates on the shuffled key / value pairs and returns a set of values. In this user-supplied methods in each phase can only see local data. Data is transmitted among compute nodes only between phases.