Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster without having to ‘think’ in MapReduce.

Cascading provides a data flow framework that leverages Hadoop’s infrastructure while abstracting standard data processing operations (split, joins, etc). It combines the scalability of Hadoop with the right level of abstraction to perform deep data diving. The project is currently used in several large-scale settings and is similar to Pig in abstracting away low-level map and reduce operations.
The Cascading API approach dramatically simplifies development, regression and integration testing, and deployment of business critical applications on both Amazon Web Services (like Elastic MapReduce) or on dedicated hardware.

Features of Cascading

  • Data Processing API
  • Topological Scheduler
  • Event Notification
  • MapReduce Job Planner
  • Stream Assertions
  • Failure Traps
  • Scriptable Interface
  • External Data Interfaces
  • Custom MapReduce Jobs

Uses case for Cascading
Security and compliance: Use Cascade to see how every everything—hosts, applications, infrastructure—is communicating, and to apply policies. For example, Cascade can alert IT when an employee is accessing resources they should not.

Hadoop and Cascading
Developers have decided to adopt Apache Hadoop as the base computing infrastructure, but realize that developing reasonably useful applications on Hadoop is not trivial. Cascading eases the burden on developers by allowing them to rapidly create, refactor, test, and execute complex applications that scale linearly across a cluster of computers.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>