Overview

Azkaban is simple batch scheduler for constructing and running Hadoop jobs or other offline processes.
A workflow scheduler allows you to string together a group of processes to run in an order that respects the dependencies between the jobs.

Batch jobs need to be scheduled to run periodically. They also typically have intricate dependency chains—for example, dependencies on various data extraction processes or previous steps. Larger processes might have 50 or 60 steps, of which some might run in parallel and others must wait for the output of earlier steps.Azkaban is a workflow scheduler that allows the independent pieces to be declaratively assembled into a single workflow, and for that workflow to be scheduled to run periodically.

What do Azkaban and Oozie do?

 

  • Both allow to run a series of map-reduce, pig, java & scripts actions a single workflow job
  • Both allow regular scheduling  of workflow jobs


On the Functional side

Writing workflows

  • Azkaban uses a series of Properties files
  • Oozie uses an XML file


Expressing workflows

  • Azkaban uses topological sort (similar to Make/Ant)
  • Oozie uses a Direct Acyclic Graph (DAG) (PDL sytle)


Supported types of actions out of the box

  • Azkaban supports: java, javaprocess and pig
  • Oozie supports: mapreduce (java, streaming, pipes), pig, java, filesystem, ssh, sub-workflow


Parameterization of workflows

  • Azkaban supports variables, i.e.: ${input}
  • Oozie supports variables and functions, i.e.: ${fs:dirSize(myInputDir)}


Alternate Execution Paths

  • Azkaban fixes execution path at workflow start time
  • Oozie supports decision nodes allowing the workflow to make decisions


Regular Scheduling

  • Azkaban interval job scheduling is time based
  • Oozie interval job scheduling is time & input-data-dependent based


Resource Control

  • Azkaban support resource locks (read/write/counter)
  • Oozie does not have explicit support for resource control


On the Implementation Side

Runtime

  • Azkaban runs as standalone (one workflows) or server (one user, multi workflows)
  • Oozie runs as server (multi user, multi workflows)


Actions Execute

  • Azkaban, actions run in the Azkaban server as the user running Azkaban
  • Oozie, actions run in the Hadoop cluster as the user that submitted the workflow


Workflows Submission, Management & Monitoring (server)

  • Azkaban, browser/HTML only
  • Oozie, command-line, HTTP REST, Java API, Browser/HTML (monitoring)


State of Running Workflows

  • Azkaban keeps state of all running workflows in memory
  • Oozie uses a SQL database, a workflow state is in memory only when doing a state transition


Resource Consumption

  • Azkaban holds at least 1 thread per running workflows
  • Oozie only uses a thread when the workflows is doing a state transition


Failover

  • Azkaban, on failure all running workflows are lost
  • Oozie, running workflows continue running from their current state

 

Use Case for Azkaban

predict People “You May Know”, generating collaborative filtering matches for all the items on our social graph, or computing data about users and their skills to populate our new Skill Pages feature, or generating beautiful graph layouts for visualizing member’s professional networks.
Azkaban has been created to support production offline processing, a workflow scheduler. In addition to dependency scheduling Azkaban handles various other tasks required to maintain production jobs such as maintaining historical logs for each job run, graphing runtime trends for jobs, locking resources, and sending email alerts for job failures or successes

Why was it made?

Schedulers are readily available (both open source and commercial), but tend to be extremely unfriendly to work with—they are basically bad graphical user interfaces grafted onto 20-year old command-line clients. The idea was create something that made it reasonably easy to visualize job hierarchies and run times without the pain

State of the project

We have been using Azkaban internally at LinkedIn for since early 2009, and have several hundred jobs running in it, mostly Hadoop jobs or ETL of some type

Some interesting LINKS:

AZKABAN

http://sna-projects.com/azkaban/

http://twit88.com/blog/2011/05/27/hadoop-batch-job-scheduler/

http://gwt.blogspot.com.ar/2011/11/azkaban-open-source-batch-job-scheduler.html

http://www.pomsets.org/FeatureComparisons/Azkaban

http://www.slideshare.net/DelhiHUG/hadoop-ecosystem-framework-n-hadoop-in-live-environment

http://www.quora.com/What-are-the-differences-advantages-disadvantages-of-Azkaban-vs-Oozie

http://groups.google.com/group/azkaban-dev/browse_thread/thread/7dfc5d72450001c4

http://stackoverflow.com/questions/9803515/how-to-use-hive-jobs-with-azkaban

http://www.slideshare.net/rjurney/azkaban-pig-5057793

La siguiente es una version mejorada de AZKABAN

http://www.pomsets.org/

facebooktwittergoogle_plusredditlinkedinby feather

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>