Monthly Archive for June, 2014

Running Hadoop trial clusters

How not to blow your EC2 budget

Lately I’ve been spinning up a lot of Hadoop clusters for various demos and just trying new features. I’m pretty sensitive to exploding my company’s EC2 budget, so I dutifully shut down all my EC2 instances when I’m done.

Doing so has been proven to be a bit painful, however. Restarting a cluster takes time and there’s usually a service or two that don’t start properly and need to be restarted. I don’t have a lot of dedicated hardware available — just my trusty MBP. I realize I could run a single-node cluster using a sandbox from one of the distribution vendors, but I really need multiple nodes for some of my testing.

I want to run multiple node clusters as VMs on my laptop and be able to leave these up and running over a weekend if necessary, and I’ve found a couple of approaches that look promising:

  • Setting up traditional VMs using Vagrant. I use Vagrant for almost all of my VMs and this would be comfortable for me.
  • Trying to run a cluster using Docker. I have not used Docker but have heard a lot about it, and I’m hoping it will help with memory requirements. I have 16 GB on my laptop but am not sure how many VMs I can run in practice.

Here’s my review after quickly trying both approaches.

VMs using Vagrant

The Vagrant file referenced in the original article works well. After installing a couple of Vagrant plugins I ran ‘vagrant up’ and had a 4-node cluster ready for CDH installation. Installation using Cloudera Manager started smoothly.

Unfortunately, it looks like running 4 VMs consuming 2 GB of RAM each is just asking too much of this machine. (The issue might be CPU more than RAM — the fan was in overdrive during installation.) I could only get 3 of the VMs to complete installation, and then during cluster configuration only two of them were available to run services.

Docker

Running Docker on the Mac isn’t ideal, as you need to fire up a VM to actually manage the containers. (I’ve yet to find the perfect development laptop. Windows isn’t wholly compatible with all of the tools I use, Mac isn’t always just like Linux, and Linux doesn’t play well with the Office tools I have to use.) On the Mac there’s definitely a learning curve. The docker containers that actually run the cluster are only accessible to the VM that’s hosting the containers, at least in the default configuration I used. That means I had to forward ports from that VM to my host OS.

Briefly, my installation steps were:

  • Launch the boot2docker app
  • Import the docker images referenced in the instructions (took about 10 minutes)
  • Run a one line command to deploy the cluster (this took about 3 minutes)
  • Grab the IP addresses for the containers and set up port forwarding on the VM for the important ports (8080, 8020, 50070, 50075)
  • Log in to Ambari and verify configuration

At this point I was able to execute a few simple HDFS operations using webhdfs to confirm that I had an operational cluster.

The verdict

The verdict is simple — I just don’t have enough horsepower to run even a 4-node cluster using traditional VMs. With Docker, I was running the managing VM and three containers in a few minutes, and I didn’t get the sense that my machine was struggling at all. I did take a quick look at resource usage but I hesitate to report the numbers as I had a lot of other stuff running on my machine at the same time.

Docker takes some getting used to, but once I had a sense of how the containers and VM were interacting I figured out how to manage the ports and configuration. I think learning Docker will be like learning Vim after using a full blown IDE — if you invest the time, you’ll be able to do some things very quickly without putting a lot of stress on your laptop.

Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies.

Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland.

David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth.

With over 15 years’ executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards.

David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail.

Specialties:IPO’s, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

The rise of real-time Hadoop?

Over the past few weeks I’ve been reviewing a number of case studies of real-world Hadoop use, including stories from name-brand companies in almost every major industry. One thing that impressed me is the number of applications that are providing operational data in near-real time, with Hadoop applications providing analysis that’s no more than an hour out of date. These aren’t just toy applications either – one case study discussed a major retailer that is analyzing pricing for more than 73 million items in response to marketing campaign effectiveness, web site trends, and even in-store customer behavior.

That’s quite a significant achievement. As recently as last year I often heard Hadoop described as an interesting technology for batch processing large volumes of data, but one for which the practical applications weren’t quite clear. It was still seen as a Silicon Valley technology in some circles.

This observation is backed up by two other trends in the Hadoop community right now. Companies like Revolution Analytics are making great strides in making the analytical tools more familiar to data scientists, while Spark is making those tools run faster. Second, vendors (including WANdisco) are focusing on Hadoop operational robustness – high availability, better cluster utilization, security, and so on. A couple of years ago you might have planned on a few hours of cluster downtime if something went wrong, but now the expectation is clearly that Hadoop clusters will get closer to nine-nines of reliability.

If you haven’t figured out your Hadoop strategy yet, or have concerns about operational reliability, be sure to give us a call. We’ve got some serious Hadoop expertise on staff.

 

SmartSVN 8.5.5 Available Now

We’re pleased to announce the release of SmartSVN 8.5.5, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.5.5 is available immediately for download from our website.

This release contains an improvement to the conflict solver along with a few bugfixes – for full details please see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Could Google have happened on WindowsNT?

Last week I had the chance to deliver a keynote presentation at the recent BigDataCamp LA (video). The topic of my talk was how free and open-source software (FOSS) is changing the face of the modern technology landscape. This peaceful, non-violent, and yet very disruptive progress is forcing us to review and adjust our daily technological choices and preferences. From how we browse the Internet, to how we communicate with other people, shop, bank, and guard our privacy – everything is affected by open source platforms, applications, and services.

Open-source-devoted businesses enabled things like Hadoop, NewSQL platforms like VoltDB, and in-memory systems like GridGain. Here at WANdisco, we are working on an open-source implementation of consensus-based replication for HDFS and HBase. This has nothing to do with charity or altruism: delivering enterprise value on top of open-source software is a very viable business model that has been proven by many companies in the last 20 years.

I argue that in the absence of an openly-available Linux OS under the GNU General Public License, Internet search as we know it wouldn’t have happened. Can you imagine Google data centers running on WindowsNT? Judge for yourself. Even better, start contributing to the revolution!

avatar

About Konstantin Boudnik

Consensus-based replication in HBase

Following up on the recent blog about Hadoop Summit 2014, I wanted to share an update on the state of consensus-based replication (CBR) for HBase. As some of our readers might know, we are working on this technology directly in the Apache HBase project. As you may also know, we are big fans and proponents of strong consistency in distributed systems, however I think the phrase “strong consistency” is a made up tautology since anything else should not be called “consistency” at all.

When we first looked into availability in HBase we noticed that it relies heavily on the Zookeeper layer. There’s nothing wrong with ZK per se, but the way the implementation is done at the moment makes ZK an integral part of the HBase source code. This makes sense from a historical perspective, since ZK has been virtually the only technology to provide shared memory storage and distributed coordination capabilities for most of HBase’s lifespan. JINI, developed back in the day by Bill Joy, is worth mentioning in this regard, but I digress and will leave that discussion for another time.

The idea behind CBR is pretty simple: instead of trying to guarantee that all replicas of a node in the system are synced post-factum to an operation, such a system will coordinate the intent of an operation. If a consensus on the feasibility of an operation is reached, it will be applied by each node independently. If consensus is not reached, the operation simply won’t happen. That’s pretty much the whole philosophy.

Now, the details are more intricate, of course. We think that CBR is beneficial for any distributed system that requires strong consistency (learn more on the topic from the recent Michael Stonebraker interview [6] on Software Engineering Radio). In the Hadoop ecosystem it means that HDFS, HBase, and possibly other components can benefit from a common API to express the coordination semantics. Such an approach will help accommodate a variety of coordination engine (CE) implementations specifically tuned for network throughput, performance, or low-latency. Introducing this concept to HBase is somewhat more challenging, however, because unlike HDFS it doesn’t have a single HA architecture: the HMaster fail-over process relies solely on ZK, whereas HRegionServer recovery additionally depends on write-ahead log (WAL) splitting. Hence, before any meaningful progress on CBR can be made, we need to abstract most, if not all, concrete implementations of ZK-based functionality behind a well-defined set of interfaces. This will provide the ability to plug in alternative concrete CEs as the community sees fit.

Below you can find the slides from my recent talk at the HBase Birds of Feather session during Hadoop Summit [1] that covers the current state of development. References [2-5] will lead you directly to the ASF JIRA tickets that track the project’s progress.

References:

  1. HBase Consensus BOF 2014
  2. https://issues.apache.org/jira/browse/HBASE-10909
  3. https://issues.apache.org/jira/browse/HBASE-11241
  4. https://issues.apache.org/jira/browse/HADOOP-10641
  5. https://issues.apache.org/jira/browse/HDFS-6469
  6. Michael Stonebraker on distributed and parallel DBs
avatar

About Konstantin Boudnik

Hadoop Summit 2014

I was lucky to be at the Hadoop Summit in San Jose last week, splitting my time between the WANdisco booth and attending some of the meetups and sessions. I didn’t keep a conference diary, but here are a few quick impressions:

1. Hadoop seems to be maturing operationally. Between WANdisco’s solutions for making key parts of Hadoop failure-proof and the core improvements coming in Hadoop 2.4.0 and beyond, the community is focusing a lot of effort on uptime and resiliency.

2. Security is still an open question. Although technologies like Knox and Kerberos integration provide good gateway and authentication support, there is no standard approach for more granular authorization. This was a consistent theme in several presentations including a case study from Booz Allen.

3. Making analytics faster and more accessible will receive a lot of attention this year. Hive 0.13 is showing dramatic performance improvements; Microsoft showed a demonstration of accessing Hadoop data through Excel Power Queries, there are several initiatives to make R run better in the Hadoop sphere – the list goes on and on.

4. The power of active-active replication continues to surprise people. Almost everyone I talked to at our booth kept asking the same questions: “So this is like a standby NameNode? You have a hot backup? It’s kind of like distcp?” No, no, and no – WANdisco’s NonStop NameNode lets you run several active (fully writable) NameNodes at once, even in different data centers, as part of a unified HDFS namespace. (If you haven’t read our product briefs, that gives you a full HA/DR solution with better performance and much better utilization of that secondary data center.  Better yet, skip the product briefs and just ask us for a demo.)

5. Beach balls are a fantastic giveaway. Kudos to our marketing team. 🙂

See you at the next one!

photo1

HBase Sponsorship and Adoption

A recent infographic from the Data Science Association showed that MongoDB is leading the pack of NoSQL and NewSQL databases in 2014:

NoSQL NewSQL Database Adoption 2014

(Source: http://www.datascienceassn.org/content/nosql-newsql-database-adoption-2014)

I’m not sure exactly where this data comes from, but it matches what I’ve heard anecdotally in the community. MongoDB seems to have a head start for many reasons, including the ease of standing up a new cluster.

Will this trend continue? To begin to answer this question, it’s worth considering the commercial interests behind these databases. This article shows a few metrics on current and projected market share. Of the databases with direct vendor sponsorship, MongoDB leads the pack at 7% compared with 3% each for Cassandra and Riak.

So where does that leave HBase? It’s running a respectable fourth place in the infographic, but that doesn’t tell the whole story. HBase is backed by the entire Hadoop community including contributors from Cloudera, Hortonworks, Facebook, and Intel. Community sponsors for HBase include all of the above plus MapR and Salesforce.com.

Now go back to the market share report and notice that HBase doesn’t have a primary commercial sponsor despite that a significant portion of the market share (and funding) going to companies like Cloudera, Hortonworks, and MapR is backing HBase as well. In the end, HBase may well benefit from being a primarily Apache-backed project that the whole community sponsors and supports, rather than being driven by a single vendor.

This fits into the trend of Apache-backed projects being significantly larger than vendor-backed projects. Apache’s namesake web server is a useful (if inexact) parallel. There are web servers out there that are certainly easier to install and configure, but a huge portion of the world’s websites run on Apache HTTPD. It’s robust, ubiquitous, and has a deep pool of community expertise. The same may be said of HBase in the future – it’s well-supported by every major Hadoop distribution, and it runs on top of Hadoop, yielding some infrastructural savings.

The next couple of years should be very interesting. I’m quite curious if HBase’s Apache heritage will give it the boost it needs to increase adoption in the NoSQL community.

GridGain goes open source

I’ve been catching up on some old RSS feed links recently and came across this gem: http://www.gridgain.com/blog/gridgains-open-source-in-memory-computing-strategy-sets-stage-for-unprecedented-advances/.  I’m not too familiar with GridGain but I’m happy to see it being open sourced and joining the wave of other solutions like Spark.

I’m a little curious if anyone has hands-on performance data comparing GridGain and Spark.  I dug up a link to a thesis that compared Spark to another technology and found some limitations in Spark when the data set size started to exceed the RAM in the cluster.  But a quick search doesn’t yield anything similar for GridGain and Spark.