Hadoop is maturing, with HDFS and ZooKeeper acting as the foundation for forms of bulk computation beyond MapReduce. This, and other developments, represent a significant change from the ‘plumbing’ focused discussions of the last few years.
Only a couple of years ago, nearly every Hadoop shop built its own ETL, high-level MapReduce language, scheduler, custom Java MapReduce jobs, and the development team often ran the cluster. Furthermore, many types of data processing were ineffective, inefficient or impossible for Hadoop developers.
All of this has changed. Here’s how:
Getting Data onto HDFS is easy:
- Sqoop - If you’re playing around with Hadoop and want to throw a database table onto HDFS, Sqoop does so in a one-liner.
- Scribe - Facebook has put a lot of work into maturing Scribe into a solid log aggregator - to get log files from server farms onto HDFS.
- Flume - Flume can get anything onto HDFS in a reliable manner.
At the storage level, HDFS has proven its generic utility.
ZooKeeper’s distributed coordination has enabled efficient implementation of different types of bulk synchronous computation on top of HDFS, extending the capabilities of Hadoop into new areas, exposing new opportunities for research and commercialization. Hadoop now excels at processing both log and graph data.
- Apache Giraph enables rapid implementation of graph algorithms that are awkward and inefficient when written in raw MapReduce.
- GoldenOrb offers a Java-based Pregel clone for graph processing from the perspective of a graph’s vertices.
High level languages for MapReduce data processing have become the dominant form of Hadoop computations.
- HIVE has served as a simple and familiar SQL-based introduction to data processing on Hadoop.
In the business intelligence sphere, HIVE has eliminated the exponential increase in the cost of analyzing atomic metrics once data volume exceeds the capacity of one large commodity machine, necessitating costly proprietary network storage solutions and the accompanying massive 'Oracle tax’ on the enterprise. The development of tools to work with semi-structured and un-structured data alongside SQL tables has amplified HIVE’s value for the analyst. - Better integration with relational databases makes getting data off Hadoop simpler. This has amplified the value of Hadoop’s Java-based data-processing model by enabling rapid ad hoc analysis in SQL on a RDBMS of the output of complex operations coded more easily on the JVM in Hadoop.
- As Apache Pig approaches version 1.0, it has matured into a turing-complete language ideally suited to harness the power of the data-flow oriented MapReduce model of computation. With UDFs in many languages, and the unique, predictive power of a functional ILLUSTRATE command and associated tools, Pig is becoming the default method of ETL as well as more complex algorithmic implementation for data pipelines. It is hoped that in the near future, the development of tools for Pig will parallel those for HIVE, or that existing open source ETL tools will fill this gap.
Tighter integration with key/value stores has enabled the rapid deployment of mined data from HDFS into a production environment for easily scalable consumption by end users.
- HBase, the Hadoop database and Project Voldemort are both highly available NoSQL solutions with tight integration with Hadoop.
In sum: the Hadoop Pantheon continues to branch and grow, creating opportunities for new kinds of applications and ventures as each tool is applied to different application domains.
- Update: The key development here is that Hadoop’s plumbing has matured, and that Hadoop now works well on graph and network data. However, a number of deserving projects not including here are:
- MongoDB - has added Hadoop integration
- Mahout - command-line machine learning on hadoop with working examples example data-sets. Damned good way to get you started leveraging machine learning in your data processing.
Please continue to submit updates and corrections and I will continue to update the post.