Oozie is an open-source workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce). Oozie is a lot of things, but being:
- A workflow solution for off Hadoop processing
- Another query processing API, a la Cascading
is not one of them.
- Complex workflow action dependencies: Oozie workflow comprises of actions and dependencies among them.
- Reduces Time-To-Market (TTM): The DAG specification enables users to specify the workflow.
- Frequency execution: Users can specify execution frequency and can wait for data arrival to trigger an action in the workflow.
- Native Hadoop stack integration: Oozie supports all types of Hadoop jobs.
- Oozie is validated against the Hadoop stack.
- Oozie is integrated with the Yahoo! Distribution of Hadoop with security and is a primary mechanism to manage a variety of complex data analysis.
Azkaban vs. Oozie:
What do Azkaban and Oozie do?
- Both allow to run a series of map-reduce, pig, java & scripts actions a single workflow job
- Both allow regular scheduling of workflow jobs
On the Implementation Side
- Azkaban runs as standalone (one workflows) or server (one user, multi workflows)
- Oozie runs as server (multi user, multi workflows)
On the Functional Side
- Azkaban interval job scheduling is time based
- Oozie interval job scheduling is time & input-data-dependent based