Apache HBASE

No Comments

Apache HBASE is an Open Source, Scalable, Consistent, Low latency and Random Access database. It’s designed to store record-oriented data across a scalable cluster of machines.

It supports ACID (Atomicity, Consistency, Isolation, Durability ) properties of databases and very low time in query and output make it low latent. It can be randomly accessed and failure do not block the system.

A distributed, column oriented, NoSQL data store that is part of the Hadoop Ecosystem. It is an open source implementation of Google’s BigTable and is used by a large number of enterprises in production environments, such as Facebook, Twitter and Yahoo.

distributed — rows are spread over many machines;
consistent — it’s strongly consistent (at a per-row level);
persistent — it stores data on disk with a log, so it sticks around;
sorted — rows are stored in sorted order, making seeks very fast;
sparse — if a row has no value in a column, it doesn’t use any space;
multi-dimensional — data is addressable in several dimensions: tables, rows, columns, versions, etc.

When do you need Hbase ?

High write throughput.
Low latency big data applications.
Fast read and write operations.
Operational Application — If you have large data operational application then Hbase it is a perfect choice but it you have any analytical application then Impala will be great.

Hbase Data Model

Data is stored in Big table.
Table consist rows and each rows has arbitrary number of columns.
Every cell value gets assigned by timestamp and it plays an important role during operations.
Row keys are lexicographically sorted and stored in Bytes [].
Column name and values are also stored in Bytes [].
Columns are grouped on the basis of properties as a column family.

Google was among the first companies to move in this direction, because they were operating at the scale of the entire web. So, they long ago built their infrastructure on top of a new kind of system (Big Table, which is the direct intellectual ancestor of HBase). It grew from there, with dozens of Open Source variants on the theme emerging in the space of a few years: Cassandra, MongoDB, Riak, Redis, CouchDB, etc.

HBase is a strongly consistent store. In the CAP Theorem, consistency (C), availability (A), and partition tolerance (P), that means it’s a (CP) store, not an (AP) store. Eventual consistency is great when used for the right purpose, but it can tend to push challenges up to the application developer. We didn’t think we’d be able to absorb that extra complexity, for general use in a product with such a large surface area.

It’s a high quality project. It did well in our benchmarks and tests, and is well respected in the community. Facebook built their entire Messaging infrastructure on HBase (as well as many other things), and the Open Source community is active and friendly.

HBase can use Hadoop’s distributed filesystem for persistence and offers first class integration with MapReduce.

Region servers: Serve data for reads and writes. DDL Statements are handled via HMaster. Colocated with Data Nodes (data locality)

Hadoop Data Node: Stores the data that the Region servers are managing as HDFS files. The crucial thing here is the data locality.

NameNode: Maintains meta information for all the stored data that compose files.

In HBase, data is physically sharded into what are known as regions. A single region server hosts each region, and each region server is responsible for one or more regions. When data is added to HBase, it’s first written to a write-ahead log (WAL) known as the HLog. Once written to the HLog, the data is then stored in an in-memory MemStore. Once the data in memory exceeds a certain threshold, it’s flushed as an HFile to disk. As the number of HFiles increases with MemStore flushes, HBase will merge several smaller files into a few larger ones, to reduce the overhead of reads. This is known as compaction.

When a region server fails, all the regions hosted by that region server will migrate to another region server, providing automatic fail-over. Due to the nature of how failover is architected in HBase, this entails splitting and replaying the contents of the WAL associated with each region, which lengthens the failover time.