Resource Management in HDFS and Parallel Databases

A recent survey from Duke University and Microsoft Research provides a fascinating overview of the evolution of massively parallel data processing systems. It starts with the evolution of traditional row-oriented parallel databases before covering columnar databases, MapReduce-based systems, and finally the latest Dataflow systems like Spark.

Two of the areas of analysis are resource management and system administration. An interesting tradeoff becomes apparent as you trace the progression of these systems.

Traditional row- and column-oriented databases have a rich set of resource management tools available. Experienced database administrators (DBAs) can tune individual nodes or the entire system based on hardware capacity, partitions, and typical workloads and data distribution. Perhaps just as importantly, the DBA can draw on decades of experience and best practices during this tuning. Linear scalability, however, is a bit more challenging. Theoretically, many parallel database systems support adding more nodes to balance work load, but in reality it requires careful management of data partitions to get the best value out of new resources.

Similarly, DBAs have access to many high quality system administration tools. These tools provide performance monitoring, query diagnostics, and recovery assistance. These tools have evolved over the years to allow very granular tuning of query plans, indexes, partitions, and schemas.

Reading between the lines, you had better have a good team of DBAs on hand. Classic database systems are expensive to purchase and operate, and knowing how to turn all of those dials to get the best performance is a challenge. Query optimization, for example, can be quite complex. Knowing how to best partition the data for efficient joins across a massive data set is not a solved problem in all cases, especially when a columnar data layout is used.

There’s a very big contrast in these areas when you look at systems built on HDFS, from the original MapReduce designs to the latest Dataflow systems like Spark. The very first design of MapReduce opted for simplicity with a static allocation of resources and the ability to easily add new nodes into the cluster. The later evolutions of Hadoop introduce improvements like YARN, which provide for more flexible resource management schemes, while still allowing for easy cluster expansion with the HDFS Rebalancer taking care of data transfer to new nodes. The newest Dataflow systems have the potential for much improved resource management, using in-memory techniques to aid in processing time. Most notably, systems like Spark can use query optimization based on DAG principles.

System administration in Hadoop is an evolving field. Some expertise exists in cluster management (or you can delegate that chore to cloud systems), but a Hadoop administrator does not have the same set of tools available to a traditional DBA; indeed a priori plan optimization is not even feasible when many ‘Big Data’ analytics packages only interpret data structure at query time.

To sum this up, I think that the ‘Big Data’ solutions have made (and continue to make) an interesting design choice by sacrificing some of the advanced resource management and system administration tools available to DBAs. (Again, some of these simply aren’t available when you do not know the data schema in advance.) Instead they favor a simplified internal representation of data and jobs, which allows for easier expansion of the cluster.

To put it another way, a finely tuned traditional parallel database will probably outperform a Hadoop cluster given sufficient hardware, expertise, and advanced knowledge of the data. On the other hand, that Hadoop cluster can grow easily with commodity hardware (beyond the breaking point of traditional systems) and not much tuning expertise other than cluster administration, which is a cost that can be spread over a large pool of applications. Plus, you don’t need to make assumptions about your data in advance. Dataflow systems like Spark will go a long way towards closing the performance gap, but in essence Big Data solutions are performing a cost-benefit analysis and coming down on the side of simplicity and ease of expansion.

This may be old hat to Big Data veterans, but I found the paper to be a great refresher on how the Big Data field reached its current position and where it’s going in the future.

0 Responses to “Resource Management in HDFS and Parallel Databases”

  • No Comments

Leave a Reply