Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.Its simple syntax and advanced built-in functionality provide an abstraction that makes development of Hadoop jobs quicker and easier to write than traditional Java MapReduce jobs.
- Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
- Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility. Users can create their own functions to do special-purpose processing.
Pig Latin is a relatively simple language that executes statements. A statement is an operation that takes input (such as a bag, which represents a set of tuples) and emits another bag as its output. A bag is a relation, similar to a table that you’ll find in a relational database (where tuples represent the rows, and individual tuples are made up of fields).
Analyzing log files is a typical use case for Pig, where you don’t have to concern yourself with MapReduce programming. Pig comes with built in libraries that help load apache log files and perform cleanup operations on string values from crude data.
Apache Pig and Hadoop
Pig makes use of both the Hadoop Distributed File System, and Hadoop’s MapReduce processing framework. By default Pig reads input files from the HDFS and writes results and intermediate data back to it. Internally Pig compiles Pig Latin sentences to produce sequences of MapReduce programs.
- Pigs eat anything: Pig can process data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured.
- Pigs live anywhere: Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework.
- Pigs are domestic animals: Pig is designed to be easily controlled and modified by its users.
- Pigs fly: The Pig project attempts to consistently improve performance, and not implement features in ways that weigh.