Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and organize it into tables analogous to those in a Relational Database Management System. These tables can then be queried using an SQL-like language called HiveQL which supports select, project, join, aggregate, union and sub-queries in the form of clause. At the same time this language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
HADOOP and HIVE
Hive is a first step in building an open-source warehouse over a web-scale MapReduce data processing system like Hadoop.
Hadoop Cluster: The cluster of inexpensive commodity computers on which the large data set is stored and all processing is performed.
Metadata Store: The location in which the description of the structure of the large data set is kept. Note that a standard database is used to store the metadata.
Warehouse Directory: This is a scratch-pad storage location that Hive uses to store/cache working files. It contains newly created tables and temporary results from user queries. For processing/communication efficiency, it is typically located on a Hadoop Distributed File System (HDFS) located on the Hadoop Cluster.
Hive Common Use Cases
- Log processing
- Text mining
- Document indexing
- Customer-facing business intelligence
- Predictive modeling, hypothesis testing