Hortonworks, one of our partners in the Open Data Platform Initiative, recently released version 2.2.4 of the Hortonworks Data Platform (HDP). It bundles Apache Spark 1.2.1. That’s a clear indicator (if we needed another one) that Spark has entered the Hadoop mainstream. Are you ready for it?
Spark opens up a new realm of use cases for Hadoop since it offers very fast in-memory data processing. Spark has blown through several Hadoop benchmarks and offers a unified batch, SQL, and streaming framework.
But Spark presents new challenges for Hadoop infrastructure architects. It favors memory and CPU with a smaller number of drives than a typical Hadoop data node. The art of monitoring and tuning Spark is still in early days.
Hortonworks is addressing many of these challenges by including Spark in HDP 2.2.4 and integrating it into Ambari. And now WANdisco is making it even easier to get started with Spark by giving you the flexibility to deploy Spark into a separate cluster while still using your production data.
WANdisco Fusion uses active-active data replication to make the same Hadoop data available and usable consistently from several Hadoop clusters. That means you can run Spark against your production data, but isolate it on a separate cluster (perhaps in the cloud) while you get up to speed on hardware sizing and performance monitoring. You can continue to run Spark this way indefinitely in order to isolate any potential performance impact, or eventually migrate Spark to your main cluster.
Shared data but separate compute resources gives you the extra flexibility you need to rapidly deploy new Hadoop technologies like Spark without impacting critical applications on your main cluster. Hortonworks and WANdisco make it easy to get started with Spark. Get in touch with our solution architects today to get started.