Big Data Blog

Page 2 of 11

Hortonworks and WANdisco make it easy to get started with Spark

Hortonworks, one of our partners in the Open Data Platform Initiative, recently released version 2.2.4 of the Hortonworks Data Platform (HDP).  It bundles Apache Spark 1.2.1.  That’s a clear indicator (if we needed another one) that Spark has entered the Hadoop mainstream.  Are you ready for it?

Spark opens up a new realm of use cases for Hadoop since it offers very fast in-memory data processing.  Spark has blown through several Hadoop benchmarks and offers a unified batch, SQL, and streaming framework.

But Spark presents new challenges for Hadoop infrastructure architects.  It favors memory and CPU with a smaller number of drives than a typical Hadoop data node.  The art of monitoring and tuning Spark is still in early days.

Hortonworks is addressing many of these challenges by including Spark in HDP 2.2.4 and integrating it into Ambari.  And now WANdisco is making it even easier to get started with Spark by giving you the flexibility to deploy Spark into a separate cluster while still using your production data.

WANdisco Fusion uses active-active data replication to make the same Hadoop data available and usable consistently from several Hadoop clusters.  That means you can run Spark against your production data, but isolate it on a separate cluster (perhaps in the cloud) while you get up to speed on hardware sizing and performance monitoring.  You can continue to run Spark this way indefinitely in order to isolate any potential performance impact, or eventually migrate Spark to your main cluster.

Shared data but separate compute resources gives you the extra flexibility you need to rapidly deploy new Hadoop technologies like Spark without impacting critical applications on your main cluster.  Hortonworks and WANdisco make it easy to get started with Spark.  Get in touch with our solution architects today to get started.

 

 

WANdisco Fusion Q&A with Jagane Sundar, CTO

Tuesday we unveiled our new product: WANdisco Fusion. Ahead of the launch, we caught up with WANdisco CTO Jagane Sundar, who was one of the driving forces behind Fusion.

Jagane joined WANdisco in November 2012 after the firm’s acquisition of AltoStor and has since played a key role in the company’s product development and rollout. Prior to founding AltoStor along with Konstantin Shvachko, Jagane was part of the original team that developed Apache Hadoop at Yahoo!.

Jagane, put simply, what is WANdisco Fusion?

JS: WANdisco Fusion is a wonderful piece of technology that’s built around a strongly consistent transactional replication engine, allowing for the seamless integration of different types of storage for Hadoop applications.

It was designed to help organizations get more out of their Big Data initiatives, answering a number of very real problems facing the business and IT worlds.

And the best part? All of your data centers are active simultaneously: You can read and write in any data center. The result is you don’t have hardware that’s lying idle in your backup or standby data center.

What sort of business problems does it solve?

JS: It provides two new important capabilities for customers. First, it keeps data consistent across different data centers no matter where they are in the world.

And it gives customers the ability to integrate different storage types into a single Hadoop ecosystem. With WANdisco Fusion, it doesn’t matter if you are using Pivotal in one data center, Hortonworks in another and EMC Isilon in a third – you can bring everything into the same environment.

Why would you need to replicate data across different storage systems?

JS: The answer is very simple. Anyone familiar with storage environments knows how diverse they can be. Different types of storage have different strengths depending on the individual application you are running.

However, keeping data synchronized is very difficult if not done right. Fusion removes this challenge while maintaining data consistency.

How does it help future proof a Hadoop deployment?

JS: We believe Fusion will form a critical component of companies’ workflow update procedures. You can update your Hadoop infrastructure one data center at a time, without impacting application availability or by having to copy massive amounts of data once the update is done.

This helps you deal with updates from both Hadoop and application vendors in a carefully orchestrated manner.

Doesn’t storage-level replication work as effectively as Fusion?

JS: The short answer is no. Storage-level replication is subject to latency limitations that are imposed by file systems. The result is you cannot really run storage-level replication over long distances, such as a WAN.

Storage-level replication is nowhere nearly as functional as Fusion: It has to happen at the LAN level and not over a true Wide Area Network.

With Fusion, you have the ability to integrate diverse systems such as NFS with Hadoop, allowing you to exploit the full strengths and capabilities of each individual storage system – I’ve never worked on a project as exciting and as revolutionary as this one.

How did WANdisco Fusion come about?

JS: By getting inside our customers’ data centers and witnessing the challenges they faced. It didn’t take long to notice the diversity of storage environments.

Our customers found that different storage types worked well for different applications – and they liked it that way. They didn’t want strict uniformity across their data centers, but to be able to leverage the strengths of each individual storage type.

At that point we had the idea for a product that would help keep data consistent across different systems.

The result was WANdisco Fusion: a fully replicated transactional engine that makes the work of keeping data consistent trivial. You only have to set it up once and never have to bother with checking if your data is consistent.

This vision of a fully utilized, strongly consistent diverse storage environment for Hadoop is what we had in mind when came up with the Fusion product.

You’ve been working with Hadoop for the last 10 years. Just how disruptive is WANdisco Fusion going to be?

JS: I’ve actually been in the storage industry for more than 15 years now. Over that period I’ve worked with shared storage systems, and I’ve worked with Hadoop storage systems. WANdisco Fusion has the potential to completely revolutionize the way people use their storage infrastructure. Frankly, this is the most exciting project I’ve ever been part of.

As the Hadoop ecosystem evolved I saw the need for this virtual storage system that integrates different types of storage.

Efforts to make Hadoop run across different data centers have been mostly unsuccessful. For the first time, we at WANdisco have a way to keep your data in Hadoop systems consistent across different data centers.

The reason this is so exciting is because it transforms Hadoop into something that runs in multiple data centers across the world.

Suddenly you have capabilities that even the original inventors of Hadoop didn’t really consider when it was conceived. That’s what makes WANdisco Fusion exciting.

The inspiration for WANdisco Fusion

Screen Shot 2015-04-21 at 10.08.22 PM

Roughly two years ago, we sat down to start work on a project that finally came to fruition this week.

At that meeting, we had set ourselves the challenge of redefining the storage landscape. We wanted to map out a world where there was complete shared storage, but where the landscape remained entirely heterogeneous.

Why? Because we’d witnessed the beginnings of a trend that has only grown more pronounced with the passage of time.

From the moment we started engaging with customers, we were struck by the extreme diversity of their storage environments. Regardless of whether we were dealing with a bank, a hospital or utility provider, different types of storage had been introduced across every organization for a variety of use cases.

In time, however, these same companies wanted to start integrating their different silos of data, whether to run real-time analytics or to gain a full 360 perspective of performance. Yet preserving diversity across data center was critical, given that each storage type has its own strengths.

They didn’t care about uniformity. They cared about performance and this meant being able to have the best of both worlds. Being able to deliver this became the Holy Grail – at least in the world of data centers.

This isn’t quite The Gordian Knot but it’s certainly a very difficult, complex problem and possibly one that could only be solved with our core, patented IP DConE.

Then we had a breakthrough.

Months later and I’m proud to formally release WANdisco Fusion (WD Fusion), the only product that enables WAN-scope active-active synchronization of different storage systems into one place.

What does this mean in practice? Well it means that you can use Hadoop distributions like Hortonworks, Cloudera or Pivotal for compute, Oracle BDA for fast compute, EMC Isilon for dense storage. You could even use a complete variety of Hadoop distros and versions. Whatever your set-up, with WD Fusion you can leverage new and existing storage assets immediately.

With it, Hadoop is transformed from being something that runs within a data center into an elastic platform that runs across multiple data centers throughout the world. WD Fusion allows you to update your storage infrastructure one data center at a time, without impacting your application ability or by having to copy vast swathes of data once the update is done.

When we were developing WD Fusion we agreed upon two things. First, we couldn’t produce anything that made changes to the underlying storage system – this had to behave like a client application. Second, anything we created had to enable a complete single global name-space across an entire storage infrastructure.

With WD Fusion, we allow businesses to bring together different storage systems by leveraging our existing intellectual property – the same Paxos-powered algorithm behind Non-Stop Hadoop, Subversion Multisite and Git Multisite – without making any changes to the platform you’re using.

Another way of putting it is we’ve managed to spread our secret sauce even further.

We have some of the best computer scientists in the world working at WANdisco, but I’m confident that this is the most revolutionary project any of us have ever worked on.

I’m delighted to be unveiling WD Fusion. It’s a testament to the talent and character of our firm, the result of looking at an impossible scenario and saying: “Challenge accepted.”

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

Active-active strategies for data protection

A new report preview from 451 Research highlights some of the challenges facing data center operators. Two of the conclusions stood out in particular.  First, disaster recovery (DR) strategies are top of mind as IT operations become increasingly centralized, increasing the cost of an outage. 42% of data center operators are evaluating DR strategies, and a majority (62%) are using active-active strategies for data protection. Second, data center operators are playing in a more complicated world now. The ability to operator applications and data centers in a hybrid cloud environment is called out as a particular area of interest.

These findings echo what we’re hearing from our own customers. For many enterprise IT architects, active-active data replication is a checklist item when deploying a vital service like a Hadoop cluster. Many WANdisco Fusion customers buy our products for precisely that reason. And we’re also seeing strong interest in WANdisco Fusion’s unique ability to provide that replication between Hadoop clusters that use different distributions and storage systems, on-premise or in the cloud.

Visit 451 Research to obtain the full report. In the meantime, our solution architects can help you evaluate your own DR and hybrid deployment strategies.

Improving HBase Scalability for Real-time Applications

When we introduced Non-Stop for Apache HBase, we explained how it would improve HBase reliability for critical applications.  But Non-Stop for Apache HBase also uniquely improves HBase scalability and performance.

By making multiple active-active region servers, Non-Stop for Apache HBase alleviates some common HBase performance woes.  First, clients are load balanced between several region servers for any particular region.  By spreading the load among several region servers, the impact of problems like region ‘hot spots’ is alleviated.

architecture-nshbase-wan

So far so good, but you might be thinking that you could get the same benefit by using HBase read-HA.  However, HBase read-HA is limited to read operations in a single data center.  Non-Stop for Apache HBase lets you put region servers in several data centers, and any of them can handle write operations.  That gives you a few nice benefits:

  • Writes can be directed to any region server, reducing the chance that a single region server becomes a bottleneck due to hot spots or garbage collection.
  • Applications at other data centers now have fast access to a ‘local’ region server.

Although the HBase community continues to try to improve HBase performance, there are some bottlenecks that just can’t be eliminated without active-active replication.  No other solution lets you use several active region servers per region, and put those region servers at any location without regard to WAN latency.

If you’ve ever struggled with HBase performance, you should give Non-Stop for Apache HBase a close look.

Reducing Hadoop network vulnerabilities for backup and replication

Managing network connections between Hadoop clusters in different data centers is a significant challenge for Hadoop and network administrators. WANdisco Fusion reduces the number of connections required for any cross-cluster data flow, thereby reducing Hadoop network vulnerabilities for backup and replication.

DistCP is the tool used for data transfer and backup in almost every Hadoop backup and workflow system.  DistCP requires connectivity from each data node in the source cluster to each data node in the target cluster.

DistCP networking

DistCP requires connections from each data node to each data node

Typically each data node – to – data node connection requires configuring two connections for inbound and outbound traffic to cross the firewall and navigate any intermediate proxies.  In a case where you have 16 data nodes in each cluster, that means [16x16x2] connections to configure, secure, and monitor – 512 in total!  That’s a nightmare for Hadoop administrators and network operators.  Just ask the people responsible for planning and running your Hadoop cluster.

WANdisco Fusion solves this problem by routing cross-cluster communication through a handful of servers.  As the diagram below illustrates, in the simplest case you’ll have one server in the source cluster talking to one server in the target cluster, requiring a grand total of 2 connections to configure.

WDFusion network connections

WANdisco Fusion requires only a handful of network connections

In a realistic deployment, you’d require additional connections for the redundant WANdisco Fusion servers – this is an active-active configuration after all.  Still, in a large deployment you’d see a few tens of connections, rather than many hundreds.

The most recent spate of data breaches is driving a 10% annual increase in cybersecurity spending.  Why make yourself more vulnerable by exposing your entire Hadoop cluster to the WAN?  Our solution architects can help you reduce your Hadoop network exposure.

Improving HBase Resilience for Real-time Applications

HBase is the NoSQL database of choice for Hadoop, and now supports critical real-time workloads in financial services and other industries. As HBase has grown more important for these workloads, the Hadoop community has focused on reducing potential down time in the event of region server failure. Rebuilding a region server can take 15 minutes or more, and even the latest improvements only provide timeline-consistent read access using standby region servers. In many critical applications, losing write access for more than a few seconds is simply unacceptable.

architecture-nshbase-hbase

Enter Non-Stop for Apache HBase. Built on WANdisco’s patented active-active replication engine, WANdisco provides fully consistent active-active access to a set of replicated region servers. That means that your HBase data is always safe and accessible for read and write activity.

architecture-nshbase-small

By providing fully consistent active-active replication for region servers, Non-Stop for Apache HBase gives applications always-on read/write access for HBase.

Putting a replica in a remote location also provides geographic redundancy. Unlike native HBase replication, region servers at other data centers are fully writable and guaranteed to be consistent. Non-Stop for Apache HBase includes active-active HBase masters, so full use of HBase can continue even if an entire cluster is lost.

Non-Stop for Apache HBase also simplifies the management of HBase, as you no longer need complicated asynchronous master-slave setups for backup and high availability.

Take a look at what Non-Stop for Apache HBase can do for your low-latency and real-time analysis applications.

Monitoring active-active replication in WANdisco Fusion

WANdisco Fusion provides a very unique capability: active-active data replication between Hadoop clusters that may be in different locations and run very different types of Hadoop.

From an operational perspective, that capability poses some new and interesting questions about cross-cluster data flow.  Which cluster is data most often originating at?  How fast is getting moving between clusters?  And how much data is flowing back and forth?

WANdisco Fusion captures a lot of detailed information about the replication of data that can help to answer those questions, and it’s exposed through a series of REST end points.  The captured information includes:

  • The origin of replicated data (which cluster it came from)
  • The size of the files
  • Transfer rate
  • Transfer start, stop, and elapsed time
  • Transfer status

A subset of this information is visible in the WANdisco Fusion user interface, but I decided it would be a good chance to dust off my R scripts and do some visualization on my own.

For example, I can see that the replication data transfer rate between my two EC2 clusters is roughly bteween 800 and 1200 kb/s.

file-xfer-rateAnd, the size of the data is pretty small, between 600 and 900 kb.

file-xfer-sizeThose are just a couple of quick examples that I captured while running a data ingest process.  But over time it will be very helpful to keep an eye on the flow of data between your Hadoop clusters.  You could see, for instance, if there are any peak ingest times for clusters in different geographic regions.

Beyond all the other benefits of WANdisco Fusion, it provides this wealth of operational data to help you manage your Hadoop deployment. Be sure to contact one of our solutions architects if this information could be of use to you.

Transforming health care with Big Data

There’s a lot of hype around Big Data these days, so it’s refreshing to hear a real success story directly from one of the practitioners.  I was lucky a couple of weeks ago to attend a talk given by Charles Boicey, an Enterprise Analytics Architect, at an event sponsored by WANdisco, Hortonworks, and Slalom Consulting.  Charles helped put a Big Data strategy in place at the University of California – Irvine (UCI) Medical Center, and is now working on a similar project at Stony Brook.

If you’ve ever read any of Atul Gawande‘s publications, you’ll know that the U.S. health care system is challenged by a rising cost curve.  Thoughtful researchers are trying to address costs and improve quality of care by reducing error rates, focusing on root causes of recurring problems, and making sure that health care practitioners have the right data at the right time to make good decisions.

Mr. Boicey is in the middle of these transformational projects.  You can read about this work on his Twitter feed and elsewhere, and WANdisco has a case study available.  One thing that caught my attention in his latest talk is the drive to incorporate data from social media and wearable devices to improve medical care.  Mr. Boicey mentioned that sometimes patients will complain on Facebook while they’re still in the hospital – and that’s probably a good thing for the doctors and nurses to know.

And of course, all of the wearable devices that track daily activity and fitness would be a boon to medical providers if they could get a handle on that data easily.  The Wall Street Journal has a good write-up on the opportunities and challenges in this area.

It’s nice to see that Big Data is concrete applications that will truly benefit society.  It’s not just a tool for making the web work better anymore.

Benefits of WANdisco Fusion

In my last post I described WANdisco Fusion’s cluster-spanning file system. Now think of what that offers you:

  • Ingest data to any cluster and share it quickly and reliably with other clusters. That’ll remove fragile data transfer bottlenecks while still letting you process data at multiple places to improve performance and get more utilization out of backup clusters.
  • Support a bimodal or multimodal architecture to enable innovation without jeopardizing SLAs. Perform different stages of the processing pipeline on the best cluster. Need a dedicated high-memory cluster for in-memory analytics? Or want to take advantage of an elastic scale-out on a cheaper cloud environment? Got a legacy application that’s locked to a specific version of Hadoop? WANdisco Fusion has the connections to make it happen. And unlike batch data transfer tools, WANdisco Fusion provides fully consistent data that can be read and written from any site.

blog-graphics-usage

  • Put away the emergency pager. If you lose data on one cluster, or even an entire cluster, WANdisco Fusion has made sure that you have consistent copies of the data at other locations.

blog-graphics-usage-failover

  • Set up security tiers to isolate sensitive data on secure clusters, or keep data local to its country of origin.

blog-graphics-usage-security

  • Perform risk-free migrations. Stand up a new cluster and seamlessly share data using WANdisco Fusion. Then migrate applications and users at your leisure, and retire the old cluster whenever you’re ready.

Read more

Interested? Check out the WANdisco Fusion page or call us for details.