Pangool is a Java, low-level MapReduce API. It aims to be a replacement for the Hadoop Java MapReduce API. By implementing an intermediate Tuple-based schema and configuring a Job conveniently, many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear. Things like secondary sort and reduce-side joins become extremely easy to implement and understand. Pangool’s performance is comparable to that of the Hadoop Java MapReduce API. Pangool also augments Hadoop’s API by making multiple outputs and inputs first-class and allowing instance-based configuration.
Pangool’s performance was tested on an Amazon AWS EC2 cluster comprised of 4 m1.large slave nodes and 1 master node, producing the following output:
- Tuple as the unit of information: The use of tuples provides developers with a great amount of flexibility in adapting to the unique features of each project. Tuples are managed efficiently by Pangool, thus reducing the overall cost of the project.
- Grouping and sorting: Any given processing task in Pangool is governed mainly by two parameters: which fields are used in grouping and which are used for sorting. This simplification is one of Pangool’s strong points
- Efficient and easy-to-implement Joins: One of the basic patterns that comes up in any Big Data project is the possibility of joining several data sets.
- Multiple inputs and outputs: Pangool’s API offers integrated support for multiple inputs and outputs so that each job can include several input data sets and several output data sets.
- Efficiency and flexibility: Pangool is a more convenient alternative to the Hadoop Java API. We can perform the same jobs in both, but the difficulty of the tasks is not the same.