HBase is an open-source distributed database that runs on Hadoop. HBase is “distributed” because it operates on many computers at once. By harnessing the power of multiple computers, you can solve problems that are much bigger than any single computer can tackle
In HBase a table is physically divided into many regions, which are in turn served by different RegionServers.
Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds
One of the biggest utility comes from being able to combine real-time HBase queries with batch MapReduce Hadoop jobs, using HDFS as a shared storage platform.
For static data, native Hive tables are significantly more efficient than HBase for both storage and access, so periodically, we can continue to take snapshots from HBase into Hive tables for use by queries where data freshness is not paramount.
When to use Hbase?
- If you have hundreds of millions or billions of rows, then HBase is a good candidate
- If you can live without all the extra features that an RDBMS provides
- If you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes
Key features of HBASE
- Strongly consistent reads/writes: This makes it very suitable for tasks such as high-speed counter aggregation.
- Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split
- Automatic RegionServer failover
- Hadoop/HDFS Integration: HBase supports HDFS out of the box as it’s distributed file system.