Azkaban
Overview
Azkaban is simple batch scheduler for constructing and running Hadoop jobs or other offline processes.
A workflow scheduler allows you to string together a group of processes to run in an order that respects the dependencies between the jobs.
Batch jobs need to be scheduled to run periodically. They also typically have intricate dependency chains—for example, dependencies on various data extraction processes or previous steps. Larger processes might have 50 or 60 steps, of which some might run in parallel and others must wait for the output of earlier steps.Azkaban is a workflow scheduler that allows the independent pieces to be declaratively assembled into a single workflow, and for that workflow to be scheduled to run periodically.
What do Azkaban and Oozie do?
- Both allow to run a series of map-reduce, pig, java & scripts actions a single workflow job
- Both allow regular scheduling of workflow jobs
On the Functional side
Writing workflows
- Azkaban uses a series of Properties files
- Oozie uses an XML file
Expressing workflows
- Azkaban uses topological sort (similar to Make/Ant)
- Oozie uses a Direct Acyclic Graph (DAG) (PDL sytle)
Supported types of actions out of the box
- Azkaban supports: java, javaprocess and pig
- Oozie supports: mapreduce (java, streaming, pipes), pig, java, filesystem, ssh, sub-workflow
Parameterization of workflows
- Azkaban supports variables, i.e.: ${input}
- Oozie supports variables and functions, i.e.: ${fs:dirSize(myInputDir)}
Alternate Execution Paths
- Azkaban fixes execution path at workflow start time
- Oozie supports decision nodes allowing the workflow to make decisions
Regular Scheduling
- Azkaban interval job scheduling is time based
- Oozie interval job scheduling is time & input-data-dependent based
Resource Control
- Azkaban support resource locks (read/write/counter)
- Oozie does not have explicit support for resource control
On the Implementation Side
Runtime
- Azkaban runs as standalone (one workflows) or server (one user, multi workflows)
- Oozie runs as server (multi user, multi workflows)
Actions Execute
- Azkaban, actions run in the Azkaban server as the user running Azkaban
- Oozie, actions run in the Hadoop cluster as the user that submitted the workflow
Workflows Submission, Management & Monitoring (server)
- Azkaban, browser/HTML only
- Oozie, command-line, HTTP REST, Java API, Browser/HTML (monitoring)
State of Running Workflows
- Azkaban keeps state of all running workflows in memory
- Oozie uses a SQL database, a workflow state is in memory only when doing a state transition
Resource Consumption
- Azkaban holds at least 1 thread per running workflows
- Oozie only uses a thread when the workflows is doing a state transition
Failover
- Azkaban, on failure all running workflows are lost
- Oozie, running workflows continue running from their current state
Use Case for Azkaban
predict People “You May Know”, generating collaborative filtering matches for all the items on our social graph, or computing data about users and their skills to populate our new Skill Pages feature, or generating beautiful graph layouts for visualizing member’s professional networks.
Azkaban has been created to support production offline processing, a workflow scheduler. In addition to dependency scheduling Azkaban handles various other tasks required to maintain production jobs such as maintaining historical logs for each job run, graphing runtime trends for jobs, locking resources, and sending email alerts for job failures or successes
Why was it made?
Schedulers are readily available (both open source and commercial), but tend to be extremely unfriendly to work with—they are basically bad graphical user interfaces grafted onto 20-year old command-line clients. The idea was create something that made it reasonably easy to visualize job hierarchies and run times without the pain
State of the project
We have been using Azkaban internally at LinkedIn for since early 2009, and have several hundred jobs running in it, mostly Hadoop jobs or ETL of some type
Some interesting LINKS:
AZKABAN
http://sna-projects.com/azkaban/
http://twit88.com/blog/2011/05/27/hadoop-batch-job-scheduler/
http://gwt.blogspot.com.ar/2011/11/azkaban-open-source-batch-job-scheduler.html
http://www.pomsets.org/FeatureComparisons/Azkaban
http://www.slideshare.net/DelhiHUG/hadoop-ecosystem-framework-n-hadoop-in-live-environment
http://www.quora.com/What-are-the-differences-advantages-disadvantages-of-Azkaban-vs-Oozie
http://groups.google.com/group/azkaban-dev/browse_thread/thread/7dfc5d72450001c4
http://stackoverflow.com/questions/9803515/how-to-use-hive-jobs-with-azkaban
http://www.slideshare.net/rjurney/azkaban-pig-5057793
La siguiente es una version mejorada de AZKABAN





Leandro Iglesias
tks… I’ld like to help you if you need more information.