0055 - 085 8112 2187

Big Data Scalable Web Sites
using Hadoop and HBase

At Docner Software, we like to use Hadoop and HBase for our Big Data web application projects. This page explains why.. And it examines the technology stack we use at Docner Software to create scalable web applications, when we need to manage millions of documents and billions of indexed objects (see also our Scalable Web Application projects).

We like hadoop and hbase !

We find that using a massively scalable persistence layer simplifies the software stack enormously. No distributed queues and process managers, no expensive middleware for message passing.. HBase and Hadoop are nicely encapsulating the distribution problem for us. Stateless web ui nodes talking to an hbase cluster - it simply does the trick. When the load on the web UI nodes increases: add machines. When the load on the data cluster increases: add machines.

Scalable Web Technology Stack

The stack we like to use builds on a bare CentOS linux distribution. We add some hardware monitoring, the most recent stable Java, then Hadoop and Hbase on our data cluster nodes.

Our Web nodes run Tomcat or Jetty where we deploy normal WAR files, but we do not use any J2EE servlet session management: the web nodes are completely stateless to make load-balancing and restarts for individual web nodes easy. We rely on hbase- and network caching to keep our logins fast. We hook our authentication mechanism into a the regular JAAS infrastructure.

From that point on, any servlet based html generating technology can be put on top. In our last projects we used Jersey REST and a minimal JAXB-based html library to generate our pages; but JSP is also an option here.

A strict separation between UI, business model and persistence logic helps us to encapsulate access to HBase within our persistence layer - nothing really special. The HBase 'DAO' objects are looked up using standard java ServiceLoader calls, or injected using Google Guice. Actually, any injection framework will do.

Noteworthy remarks about this stack:

This is not only an advantage for developers in the maintenance phase, but also experienced java system operators will recognize most of the toolset instantly. We only change the game where it matters: the hadoop distributed filesystem and the hbase key-value storage.

HBase as a persistence layer for web applications

While the HBase stack is clearly not as mature as Hibernate on top of SQL, the simplicity of the HBase API and the schemaless design makes development fast and efficient. This flexibility does have a price: heavy unit testing on the persistence layer plus strict regression testing must make up for the lack of automatic object-to-record translations. However, as TDD becomes a second nature we actually prefer this 'controlled flexibility' over 'magic mechanisms'.

Working at this lower level provides us with much better control over our performance. For example, when programming the HBase API you are deciding yourself when to scan over a range of keys, and when to scan a complete table. Because you decide every step in the data retrieval process, awareness of and a feeling for performance issues are quite natural. Compare this with trying to get good performance out of SQL; where the query optimizer is hidden from developers by a generic language like SQL and then abstracted away even more when you put Hibernate or JPA on top of that. As we are developing for massive amounts of data, performance is critical. Using HBase we have all the control: we can easily tweak and redesign until we get it right.