Hadoop HDFS

Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Hadoop MapReduce

Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.


Kafka is a persistent, efficient, distributed message queue capable of offloading data into Hadoop for batch parallel processing.


ZooKeeper is high throughput, low latency, highly available service used for data coordination across the nodes of distributed applications.


Pig is a high-level language used to express analytics software. It acts as an abstraction that makes development of Hadoop jobs quicker and easier than traditional MapReduce.


Pangool is a Java, low-level MapReduce API. By implementing an intermediate Tuple-based schema and configuring a Job conveniently, many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear


Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL


Storm is a real-time, in-memory, distributed computation system. It’s simple, fault tolerant, horizontally scalable, fast, and capable of working with any programming language (by implementing a simple Storm communication protocol).


Azkaban is simple batch scheduler for constructing and running Hadoop jobs or other offline processes.


Voldemort is a distributed data store designed to support fast, scalable, read/write loads.


Oozie is an open-source workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop.


Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster without having to think in MapReduce.


CouchDB is a semi-structured, document oriented NoSQL database. Data is stored in JSON format and can change dynamically to accommodate evolving needs.


Sqoop is a tool designed to import data from relational databases into Hadoop. It uses JDBC to connect to a DB, examine its tables and generate the code needed to automatically run a MapReduce job which in parallel read data from the tables.


Avro is a remote procedure call and serialization which uses JSON to define data types and protocols, and serialize data in a compact binary format.


Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure.


The Mahout project aims at developing commercially-friendly, scalable machine learning algorithms such as classification, clustering, regression and dimension reduction.


Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.


MongoDB is an open source, document-oriented (JSON-like) database designed with both scalability and developer agility in mind. MongoDB maintains many of the great features of a relational database — like indexes and dynamic queries. But by changing the data model from relational to document-oriented, you gain agility through flexible schemas and easier horizontal scalability.


HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s BigTable. It aims at hosting very large tables (billions of rows X millions of columns) atop clusters of commodity hardware.


Chukwa is an open source data collection system built on top of Hadoop for managing large distributed systems. It includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>