Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

Distributed Code Review

As I’ve written about previously, one of the compelling reasons to look at Git as an enterprise SCM system is the great workflow innovation in the Git community. Workflows like Git Flow have pulled in best practices like short lived task branches and made them not only palatable but downright convenient. Likewise, the role of the workflow tools like Gerrit should not be discounted. They’ve turned mandatory code review from an annoyance to a feature that developers can’t live without (although we call it social coding now).

But as any tool skeptic will tell you, you should hesitate before building your development process too heavily on these tools. You’ll risk locking in to the way the tool works – and the extra data that is stored in these tools is not very portable.

The data stored in Git is very portable, of course. A developer can clone a repository, maintain a fork, and still reasonably exchange data with other developers. Git has truly broken the bond between code and a central SCM service.

As fans of social coding will tell you, however, the conversation is often just as important as the code. The code review data holds a rich history of why a change was rejected, accepted, or resubmitted. In addition, these tools often serve as the gatekeeper’s tools: if your pull request is rejected, your code isn’t merged.

Consider what happens if you decide you need to switch from one code review tool to another. All of your code review metadata is likely stored in a custom schema in a relational database. Moving, say, from Gerrit to GitLab would be a significant data migration effort – or you just accept the fact that you’ll lose all of the code review information you’ve stored in Gerrit.

For this reason, I was really happy to hear about the distributed code review system now offered in SmartGit. Essentially SmartGit is using Git to store all of the code review metadata, making it as portable as the code itself. When you clone the repository, you get all of the code review information too. They charge a very modest fee for the GUI tools they’ve layered on top, but you can always take the code review metadata with you, and they’ve published the schema so you can make sense of it. Although I’ve only used it lightly myself, this system breaks the chain between my Git repo and the particular tool that my company uses for repository management and access control.

I know distributed bug trackers fizzled out a couple of years ago, but I’m very happy to see Syntevo keep the social coding conversation in the same place as the code.

Git MultiSite Cluster Performance

A common misconception about Git is that having a distributed version control system automatically immunizes you from performance problems. The reality isn’t quite so rosy. As you’ll hear quite often if you read about tools like Gerrit, busy development sites make a heavy investment to cope with the concurrent demands on a Git server posed by developers and build automation.

Here’s where Git MultiSite comes into the picture. Git MultiSite is known for providing a seamless HA/DR solution and excellent performance at remote sites, but it’s also a great way to increase elastic scalability within a single data center by adding more Git MultiSite nodes to cope with increased load. Since read operations (clones and pulls) are local to a single node and write operations (pushes) are coordinated, with the bulk of the data transfer happening asynchronously, Git MultiSite lets you scale out horizontally. You don’t have to invest in extremely high-end hardware or worry about managing and securing Git mirrors.

So how much does Git MultiSite help? Ultimately that depends on your particular environment and usage patterns, but I ran a little test to illustrate some of the benefits even when running in a fairly undemanding environment.

I set up two test environments in Amazon EC2. Both environments used a single instance to run the Git client operations. The first environment used a regular Git server with a new empty repository accessed over SSH. The second environment instead used three Git MultiSite nodes.  All servers were m1.large instances.

The test ran a series of concurrent clone, pull, and push operations for an hour. The split between read and write operations was roughly 7:1, a pretty typical ratio in an environment where developers are pulling regularly and pushing periodically, and automated processes are cloning and pulling frequently. I used both small (1k) and large (10MB) commits while pushing.

What did I find?

Git MultiSite gives you more throughput

Git MultiSite processed more operations in an hour. There were no dropped operations, so the servers were not under unusual stress.

throughput

Better Performance

Git MultiSite provided significantly better performance, particularly for reads. That makes a big difference for developer productivity.

perf

More Consistent Performance

Git MultiSite provides a more consistent processing rate.

procrate

You won’t hit any performance cliffs as the load increases.

pullrunner

Try it yourself

We perform regular performance testing during evaluations of Git MultiSite. How much speed do you need?

Big Data ETL Across Multiple Data Centers

Scientific applications, weather forecasting, click-stream analysis, web crawling, and social networking applications often have several distributed data sources, i.e., big data is collected in separate data center locations or even across the Internet.

In these cases, the most efficient architecture for running extract, transform, load (ETL) jobs over the entire data set becomes nontrivial.

Hadoop provides the Hadoop Distributed File System (HDFS) for storage and YARN (Yet Another Resource Negotiator) as the programming model in Hadoop 2.0. ETL jobs use the MapReduce programming model to run on the YARN framework.

Though these are adequate for a single data center, there is a clear need to enhance them for multi-data center environments. In these instances, it is important to provide active-active redundancy for YARN and HDFS across data centers. Here’s why:

1. Bringing compute to data

Hadoop’s architectural advantage lies in bringing compute to data. Providing active-active (global) YARN accomplishes that on top of global HDFS across data centers.

2. Minimizing traffic on a WAN link

There are three types of data analytics schemes:

a) High-throughput analytics where the output data of a MapReduce job is small compared to the input.

Examples include weblogs, word count, etc.

b) Zero-throughput analytics where the output data of a MapReduce job is equal to the input. A sort operation is a good example of a job of this type.

c) Balloon-throughput analytics where the output is much larger than the input.

Local YARN can crunch the data and use global HDFS to redistribute for high throughput analytics. Keep in mind that this might require another MapReduce job running on the output results, however, which can add traffic to the WAN link. Global YARN mitigates this even further by distributing the computational load.

Last but not least, fault tolerance is required at the server, rack, and data center levels. Passive redundancy solutions can cause days of downtime before resuming. Active-active redundant YARN and HDFS provide zero-downtime solutions for MapReduce jobs and data.

To summarize, it is imperative for mission-critical applications to have active-active redundancy for HDFS and YARN. Not only does this protect data and prevent downtime, but it also allows big data to be processed at an accelerated rate by taking advantage of the aggregated CPU, network and storage of all servers across datacenters.

- Gurumurthy Yeleswarapu, Director of Engineering, WANdisco

More efficient cluster utilization with Non-Stop Hadoop

Perhaps the most overlooked capability of WANdisco’s Non-Stop Hadoop is its efficient cluster utilization in secondary data centers. These secondary clusters are often used only for backup purposes, which is a waste of valuable computing resources. Non-Stop Hadoop allows you to take full advantage of the CPU and storage resources that you’ve paid for.

Of course anyone who adopts Hadoop needs a backup strategy, and the typical solution is to put a backup cluster in a remote data center. distcp, a part of the core Hadoop distribution, is used to periodically transfer data from the primary cluster to the backup cluster. You can also run some read-only jobs on the backup cluster, as long as you don’t need immediate access to the latest data.

Still, that backup cluster is a big investment that isn’t being used fully. What if you could treat that backup cluster as a part of your unified Hadoop environment, and use it fully for any processing?  That would give you a better return on that backup cluster investment, and let you shift some load off of the primary cluster, perhaps reducing the need for additional primary cluster capacity.

That’s exactly what Non-stop Hadoop provides: you can treat several Hadoop clusters as part of a single unified Hadoop file system. All of the important data is replicated efficiently by Non-Stop Hadoop, including the NameNode metadata and the actual data blocks. You can write data into any of the clusters, knowing that the metadata will be kept in sync by Non-Stop Hadoop and that the actual data will be transferred seamlessly (and much faster compared to using a tool like distcp).

As a simple example, recently I was ingesting two streams of data into a Hadoop cluster. Each ingest job handled roughly the same amount of data. The two jobs combined took up about 28 seconds of cluster CPU time during each run and consumed roughly 500MB of cluster RAM during operation.

Then I decided to run each job separately on two clusters that are part of a single Non-Stop Hadoop deployment. In this case, again running both jobs at the same time, I took up 15 seconds on the first cluster and 18 seconds on the second cluster, using about 250MB of RAM on each.

The exact numbers will vary depending on the job and what else is running on the cluster, but in this simple example I’ve accomplished three very useful things:

  • I’ve gotten useful work out of a second cluster that would otherwise be idle.
  • I’ve shifted half of the processing burden off of the first cluster. (It also helps to have six NameNodes instead of 2 to handle the concurrent writes.)
  • I don’t have to run the distcp job to transfer this data to a backup site – it’s already on both clusters. Not only am I getting more useful work out of my second cluster, I’m avoiding unnecessary overhead work.

So there you have it – Non-Stop Hadoop is the perfect way to get more bang for your backup cluster buck. Want to know more? We’ll be happy to discuss in more detail.

Talking R in an Excel World

I just finished reading a good summary of 10 powerful and free or low-cost analytics tools.  Of the 10 items mentioned, 2 are probably common in corporate environments (Excel and Tableau) while the other 8 are more specialized.  I wonder how successful anyone is at introducing a new specialized analytics tool into an environment where Excel is the lingua franca?

I’ve personally run into this situation a few times.  Recently I ran a Monte Carlo simulation in R to generate a simple P&L forecast for a business proposal.  I followed good practices by putting my code into R-markdown and generating a PDF with the code, assumptions, variables, and output.  I then had to share some of the conclusions with a colleague who was generating pricing models in a spreadsheet, however, and knowing that introducing R on short notice wouldn’t go over well, I just copied some of the key results into Excel and sent it off.

Similarly, a few months ago I was working on a group project to analyze some public data.  I used R to understand the data and perform the analysis, but then had to reproduce the final analysis in Excel to share it with the team.  That seems wasteful, but I couldn’t see another way to do it.  It would have taken me quite a long time to do all the exploratory work in Excel (I’ve not yet figured out how to create several charts in a loop in Excel) and Excel just doesn’t have the same type of tools (like principal components for dimension reduction).

R does have some capabilities to talk Excel, but these don’t seem particularly easy to use and the advice is typically to use CSV as an interchange format, which has obvious limitations including the loss of formulas and formatting.

So I’m stumped.  As long as Excel remains a standard interchange format I guess I’ll just have to do some manual data translation.  Has anyone solved this problem in a more elegant way?

 

Running Hadoop trial clusters

How not to blow your EC2 budget

Lately I’ve been spinning up a lot of Hadoop clusters for various demos and just trying new features. I’m pretty sensitive to exploding my company’s EC2 budget, so I dutifully shut down all my EC2 instances when I’m done.

Doing so has been proven to be a bit painful, however. Restarting a cluster takes time and there’s usually a service or two that don’t start properly and need to be restarted. I don’t have a lot of dedicated hardware available — just my trusty MBP. I realize I could run a single-node cluster using a sandbox from one of the distribution vendors, but I really need multiple nodes for some of my testing.

I want to run multiple node clusters as VMs on my laptop and be able to leave these up and running over a weekend if necessary, and I’ve found a couple of approaches that look promising:

  • Setting up traditional VMs using Vagrant. I use Vagrant for almost all of my VMs and this would be comfortable for me.
  • Trying to run a cluster using Docker. I have not used Docker but have heard a lot about it, and I’m hoping it will help with memory requirements. I have 16 GB on my laptop but am not sure how many VMs I can run in practice.

Here’s my review after quickly trying both approaches.

VMs using Vagrant

The Vagrant file referenced in the original article works well. After installing a couple of Vagrant plugins I ran ‘vagrant up’ and had a 4-node cluster ready for CDH installation. Installation using Cloudera Manager started smoothly.

Unfortunately, it looks like running 4 VMs consuming 2 GB of RAM each is just asking too much of this machine. (The issue might be CPU more than RAM — the fan was in overdrive during installation.) I could only get 3 of the VMs to complete installation, and then during cluster configuration only two of them were available to run services.

Docker

Running Docker on the Mac isn’t ideal, as you need to fire up a VM to actually manage the containers. (I’ve yet to find the perfect development laptop. Windows isn’t wholly compatible with all of the tools I use, Mac isn’t always just like Linux, and Linux doesn’t play well with the Office tools I have to use.) On the Mac there’s definitely a learning curve. The docker containers that actually run the cluster are only accessible to the VM that’s hosting the containers, at least in the default configuration I used. That means I had to forward ports from that VM to my host OS.

Briefly, my installation steps were:

  • Launch the boot2docker app
  • Import the docker images referenced in the instructions (took about 10 minutes)
  • Run a one line command to deploy the cluster (this took about 3 minutes)
  • Grab the IP addresses for the containers and set up port forwarding on the VM for the important ports (8080, 8020, 50070, 50075)
  • Log in to Ambari and verify configuration

At this point I was able to execute a few simple HDFS operations using webhdfs to confirm that I had an operational cluster.

The verdict

The verdict is simple — I just don’t have enough horsepower to run even a 4-node cluster using traditional VMs. With Docker, I was running the managing VM and three containers in a few minutes, and I didn’t get the sense that my machine was struggling at all. I did take a quick look at resource usage but I hesitate to report the numbers as I had a lot of other stuff running on my machine at the same time.

Docker takes some getting used to, but once I had a sense of how the containers and VM were interacting I figured out how to manage the ports and configuration. I think learning Docker will be like learning Vim after using a full blown IDE — if you invest the time, you’ll be able to do some things very quickly without putting a lot of stress on your laptop.

The rise of real-time Hadoop?

Over the past few weeks I’ve been reviewing a number of case studies of real-world Hadoop use, including stories from name-brand companies in almost every major industry. One thing that impressed me is the number of applications that are providing operational data in near-real time, with Hadoop applications providing analysis that’s no more than an hour out of date. These aren’t just toy applications either – one case study discussed a major retailer that is analyzing pricing for more than 73 million items in response to marketing campaign effectiveness, web site trends, and even in-store customer behavior.

That’s quite a significant achievement. As recently as last year I often heard Hadoop described as an interesting technology for batch processing large volumes of data, but one for which the practical applications weren’t quite clear. It was still seen as a Silicon Valley technology in some circles.

This observation is backed up by two other trends in the Hadoop community right now. Companies like Revolution Analytics are making great strides in making the analytical tools more familiar to data scientists, while Spark is making those tools run faster. Second, vendors (including WANdisco) are focusing on Hadoop operational robustness – high availability, better cluster utilization, security, and so on. A couple of years ago you might have planned on a few hours of cluster downtime if something went wrong, but now the expectation is clearly that Hadoop clusters will get closer to nine-nines of reliability.

If you haven’t figured out your Hadoop strategy yet, or have concerns about operational reliability, be sure to give us a call. We’ve got some serious Hadoop expertise on staff.

 

SmartSVN 8.5.5 Available Now

We’re pleased to announce the release of SmartSVN 8.5.5, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.5.5 is available immediately for download from our website.

This release contains an improvement to the conflict solver along with a few bugfixes – for full details please see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Could Google have happened on WindowsNT?

Last week I had the chance to deliver a keynote presentation at the recent BigDataCamp LA (video). The topic of my talk was how free and open-source software (FOSS) is changing the face of the modern technology landscape. This peaceful, non-violent, and yet very disruptive progress is forcing us to review and adjust our daily technological choices and preferences. From how we browse the Internet, to how we communicate with other people, shop, bank, and guard our privacy – everything is affected by open source platforms, applications, and services.

Open-source-devoted businesses enabled things like Hadoop, NewSQL platforms like VoltDB, and in-memory systems like GridGain. Here at WANdisco, we are working on an open-source implementation of consensus-based replication for HDFS and HBase. This has nothing to do with charity or altruism: delivering enterprise value on top of open-source software is a very viable business model that has been proven by many companies in the last 20 years.

I argue that in the absence of an openly-available Linux OS under the GNU General Public License, Internet search as we know it wouldn’t have happened. Can you imagine Google data centers running on WindowsNT? Judge for yourself. Even better, start contributing to the revolution!