Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

Sample datasets for Big Data experimentation

Another week, another gem from the Data Science Association. If you’re trying to prototype a data analysis algorithm, benchmark performance on a new platform like Spark, or just play around with a new tool, you’re going to need reliable sample data.

As anyone familiar with testing knows, good data can be tough to find. Although there’s plenty of data in the public domain, most of it is not ready to use. A few months ago, I downloaded some data sets from a US government site and it took a few hours of cleaning before I had the data in shape for analysis.

Behold: https://github.com/datasets. Here the Frictionless Data Project has compiled a set of easily accessible and well documented data sets. The specific data may not be of much interest, but these are great for trials and experimentation. For example, if you want to analyze time series financial data, there’s a CSV file with updated S&P 500 data.

Well worth a look!

SmartSVN 8.6 Available Now

We’re pleased to announce the release of SmartSVN 8.6, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6 is available immediately for download from our website.

New Features include:

  • Bug reporting now optionally allows uploading bug reports directly to WANdisco from within SmartSVN
  • Improved handling of svn:global-ignores inherited property
  • Windows SASL authentication support added and required DLLs provided

Fixes include:

  • Internal error when selecting a file in the file table
  • Possible internal error in repository browser related to file externals
  • Potentially incorrect rendering of directory tree in Linux

For a full list of all improvements and features please see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Experiences with R and Big Data

The next releases of Subversion MultiSite Plus and Git MultiSite will embed Apache Flume for audit event collection and transmission. We’re taking an incremental approach to audit event collection and analysis, as the throughput at a busy site could generate a lot of data.

In the meantime, I’ve been experimenting with some more advanced and customized analysis. I’ve got a test system instrumented with a custom Flume configuration that pipes data into HBase instead of our Access Control Plus product. The question then is how to get useful answers out of HBase to questions like: What’s the distribution of SCM activity between the nodes in the system?

It’s actually not too bad to get that information directly from an HBase scan, but I also wanted to see some pretty charts. Naturally I turned to R, which led me again to the topic of how to use R to analyze Big Data.

A quick survey showed three possible approaches:

  • The RHadoop packages provided by Revolution Analytics, which includes RHBase and Rmr (R MapReduce)
  • The SparkR package
  • The Pivotal package that lets you analyze data in Hawq

I’m not using Pivotal’s distribution and I didn’t want to invest time in looking at a MapReduce-style analysis, so that left me with RHBase and Spark R.

Both packages were reasonably easy to install as these things go, and RHBase let me directly perform a table scan and crunch the output data set. I was a bit worried about what would happen once a table scan started returning millions of rows instead of thousands, so I wanted to try SparkR as well.

SparkR let me define a data source (in this case an export from HBase) and then run a functional reduce on it. In the first step I would produce some metric of interest (AuthZ success/failure for some combination of repository and node location) for each input line, and then reduce by key to get aggregate statistics. Nothing fancy, but Spark can handle a lot more data than R on a single workstation. The Spark programming paradigm fits nicely into R; it didn’t feel nearly as foreign as writing MapReduce or HBase scans. Of course, Spark is also considerably faster than normal MapReduce.

Here’s a small code snippet for illustration:

 

lines <- textFile(sc, "/home/vagrant/hbase-out/authz_event.csv")
mlines = lapply(lines, function(line) {
       return(list(key, metric))
       })
parts = reduceByKey(mlines, "+", 2L)
reduced = collect(parts)

 

In reality, I might use SparkR in a lambda architecture as part of my serving layer and RHBase as part of the speed layer.

It already feels like these extra packages are making Big Data very accessible to the tools that data scientists use, and given that data analysis is driving a lot of the business use cases for Hadoop, I’m sure we’ll see more innovation in this area soon.

Subversion Vulnerability in Serf

The Apache Subversion team have recently published details of two vulnerabilities in the Serf RA layer.

Firstly, vulnerable versions of the Serf RA layer will accept certificates that it should not accept as matching the hostname the client is using to make the request. This is deemed a Medium risk vulnerability.

Additionally, affected versions of the Serf RA layer do not properly handle certificates with embedded NUL bytes in their Common Names or Subject Alternate Names. This is deemed a low risk vulnerability.

Either of these issues, or a combination of both, could lead to a man-in-the-middle attack and allow viewing of encrypted data and unauthorised repository access.

A further vulnerability has also been identified in the way that Subversion indexes cached authentication credentials. An MD5 hash collision can be engineered such that cached credentials are leaked to a third party. This is deemed a Low risk vulnerability.

For more information on these issues please see the following links:
http://subversion.apache.org/security/CVE-2014-3522-advisory.txt
http://subversion.apache.org/security/CVE-2014-3528-advisory.txt
https://groups.google.com/forum/#!msg/serf-dev/NvgPoK6sFsc/_TR7Buxtba0J

The ra_serf vulnerability affects Subversion versions 1.4.0-1.7.17 and 1.8.0-1.8.9. The Serf library vulnerability affects Serf versions 0.2.0 through 1.3.6 inclusive. Finally, the credentials vulnerability affects Subversion versions 1.0.0-1.7.17 and 1.8.0-1.8.9.

If you are using any of the vulnerable versions mentioned above we would urge you to upgrade to the latest release, either 1.8.10 or 1.7.18. Both are available on our website at https://www.wandisco.com/subversion/download

Spark and Hadoop infrastructure

I just read another article about how Spark stands a good chance of supplanting MapReduce for many use cases. As an in-memory platform, Spark provides answers much faster than MapReduce, which must perform an awful lot of disk I/O to process data.

Yet MapReduce isn’t going away. Beside all of the legacy applications built on MapReduce, MapReduce can still handle much larger data sets. Spark’s limit is in terabytes, while MapReduce can handle petabytes.

There’s one interesting question I haven’t seen discussed, however. How do you manage the different hardware profiles for Spark and other execution engines? A general-purpose MapReduce cluster will likely balance I/O, CPU, and RAM, while a cluster tailored for Spark will emphasize RAM and CPU much more heavily than I/O throughput. (As one simple example, switching a Spark MLlib job to a cluster that allowed allocation of 12GB of RAM per executor cut the run time from 370 seconds to 14 seconds.)

From what I’ve heard, YARN’s resource manager doesn’t handle hybrid hardware profiles in the same cluster very well yet. It will tend to pick the ‘best’ data node available for a task, but that means it will tend to pick your big-memory, big-CPU nodes for everything, not just Spark jobs.

So what’s the answer? One possibility is to set up multiple clusters, each tailored for different types of processing. They’ll have to share data, of course, which is where it gets complicated. The usual techniques for moving data between clusters (i.e. the tools built on distcp) are meant for schedule synchronization – in other words, backups. Unless you’re willing to accept a delay before data is available to both clusters, and unless you’re extremely careful about which parts of the namespace each cluster effectively ‘owns’, you’re out of luck…unless you use Non-Stop Hadoop, that is.

Non-Stop Hadoop lets you treat two or more clusters as a unified HDFS namespace, even when the clusters are separated by a WAN. Each cluster can read and write simultaneously, using WANdisco’s active-active replication technology to keep the HDFS metadata in sync. In addition, Non-Stop Hadoop’s efficient block-level replication between clusters means data transfers much more quickly.

This means you can set up two clusters with different hardware profiles, running Spark jobs on one and traditional MapReduce jobs on the other, without any additional administrative overhead. Same data, different jobs, better results.

Interested? We’ve got a great demo ready and waiting.

 

SmartSVN 8.6 RC1 Available Now

We’re proud to announce the release of SmartSVN 8.6 RC1. SmartSVN is the cross-platform graphical client for Apache Subversion.

New features include:

  • File protocol authentication implemented, using system login as default
  • “Manage as Project” menu item added for unmanaged working copies
  • Navigation buttons added to notifications to show previous/next notification

Fixes include:

  • Navigation buttons added to notifications to show previous/next notification
  • Opening a missing project could end up in an endless loop
  • Linux: Refresh might not have picked up changes if inotify limit was reached
  • Merge dialog “Merge From” label was cut off
  • Illegal character error while editing svn:mergeinfo
  • Context menu wasn’t available in commit message text area
  • Progress window for explorer integration context menu actions sometimes appeared behind shell window

For a full list of all improvements and features please see the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download Release Candidate 1 for SmartSVN 8.6 from our website.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Health care and Big Data

What’s the general impression of the public sector and health care? Stodgy. Buried in paperwork. Dull. Bureaucrats in cubicles and harried nurses walking around with reams of paper charts.

Behind the scenes, however, the health care industry and their counterparts in the public sector have been quietly launching a wave of technological innovation and gaining valuable insight along the way. Look no further than some of their ‘lessons learned’ articles, including this summary of a recent study in Health Affairs. The summary is well worth a read, as the lessons are broadly applicable no matter what industry you’re in:

  • Focus on acquiring the right data
  • Answer questions of general interest, not questions that show off the technology
  • Understand the data
  • Let the people who understand the data have access to it in any form

Accomplishing these goals, they found, required a broader and more sophisticated Hadoop infrastructure than they had anticipated.

Of course, this realization isn’t too much of a surprise here at WANdisco. One of the early adopters of Non-Stop Hadoop was UC Irvine Health, a leading research and clinical center in California. UC Irvine Health has been recognized as an innovation center, and is currently exploring new ways to use Big Data to improve the quality and indeed the entire philosophy of its care.

You may be thinking you’ve still got time to figure out a Big Data strategy. Real time analytics on Hadoop? Wiring multiple data centers into a logical data lake? Not quite ready for prime time? Before you prognosticate any further, give us a call. Even ‘stodgy’ industries are seeing Big Data’s disruption up close and personal.

Distributed Code Review

As I’ve written about previously, one of the compelling reasons to look at Git as an enterprise SCM system is the great workflow innovation in the Git community. Workflows like Git Flow have pulled in best practices like short lived task branches and made them not only palatable but downright convenient. Likewise, the role of the workflow tools like Gerrit should not be discounted. They’ve turned mandatory code review from an annoyance to a feature that developers can’t live without (although we call it social coding now).

But as any tool skeptic will tell you, you should hesitate before building your development process too heavily on these tools. You’ll risk locking in to the way the tool works – and the extra data that is stored in these tools is not very portable.

The data stored in Git is very portable, of course. A developer can clone a repository, maintain a fork, and still reasonably exchange data with other developers. Git has truly broken the bond between code and a central SCM service.

As fans of social coding will tell you, however, the conversation is often just as important as the code. The code review data holds a rich history of why a change was rejected, accepted, or resubmitted. In addition, these tools often serve as the gatekeeper’s tools: if your pull request is rejected, your code isn’t merged.

Consider what happens if you decide you need to switch from one code review tool to another. All of your code review metadata is likely stored in a custom schema in a relational database. Moving, say, from Gerrit to GitLab would be a significant data migration effort – or you just accept the fact that you’ll lose all of the code review information you’ve stored in Gerrit.

For this reason, I was really happy to hear about the distributed code review system now offered in SmartGit. Essentially SmartGit is using Git to store all of the code review metadata, making it as portable as the code itself. When you clone the repository, you get all of the code review information too. They charge a very modest fee for the GUI tools they’ve layered on top, but you can always take the code review metadata with you, and they’ve published the schema so you can make sense of it. Although I’ve only used it lightly myself, this system breaks the chain between my Git repo and the particular tool that my company uses for repository management and access control.

I know distributed bug trackers fizzled out a couple of years ago, but I’m very happy to see Syntevo keep the social coding conversation in the same place as the code.

Git MultiSite Cluster Performance

A common misconception about Git is that having a distributed version control system automatically immunizes you from performance problems. The reality isn’t quite so rosy. As you’ll hear quite often if you read about tools like Gerrit, busy development sites make a heavy investment to cope with the concurrent demands on a Git server posed by developers and build automation.

Here’s where Git MultiSite comes into the picture. Git MultiSite is known for providing a seamless HA/DR solution and excellent performance at remote sites, but it’s also a great way to increase elastic scalability within a single data center by adding more Git MultiSite nodes to cope with increased load. Since read operations (clones and pulls) are local to a single node and write operations (pushes) are coordinated, with the bulk of the data transfer happening asynchronously, Git MultiSite lets you scale out horizontally. You don’t have to invest in extremely high-end hardware or worry about managing and securing Git mirrors.

So how much does Git MultiSite help? Ultimately that depends on your particular environment and usage patterns, but I ran a little test to illustrate some of the benefits even when running in a fairly undemanding environment.

I set up two test environments in Amazon EC2. Both environments used a single instance to run the Git client operations. The first environment used a regular Git server with a new empty repository accessed over SSH. The second environment instead used three Git MultiSite nodes.  All servers were m1.large instances.

The test ran a series of concurrent clone, pull, and push operations for an hour. The split between read and write operations was roughly 7:1, a pretty typical ratio in an environment where developers are pulling regularly and pushing periodically, and automated processes are cloning and pulling frequently. I used both small (1k) and large (10MB) commits while pushing.

What did I find?

Git MultiSite gives you more throughput

Git MultiSite processed more operations in an hour. There were no dropped operations, so the servers were not under unusual stress.

throughput

Better Performance

Git MultiSite provided significantly better performance, particularly for reads. That makes a big difference for developer productivity.

perf

More Consistent Performance

Git MultiSite provides a more consistent processing rate.

procrate

You won’t hit any performance cliffs as the load increases.

pullrunner

Try it yourself

We perform regular performance testing during evaluations of Git MultiSite. How much speed do you need?

Big Data ETL Across Multiple Data Centers

Scientific applications, weather forecasting, click-stream analysis, web crawling, and social networking applications often have several distributed data sources, i.e., big data is collected in separate data center locations or even across the Internet.

In these cases, the most efficient architecture for running extract, transform, load (ETL) jobs over the entire data set becomes nontrivial.

Hadoop provides the Hadoop Distributed File System (HDFS) for storage and YARN (Yet Another Resource Negotiator) as the programming model in Hadoop 2.0. ETL jobs use the MapReduce programming model to run on the YARN framework.

Though these are adequate for a single data center, there is a clear need to enhance them for multi-data center environments. In these instances, it is important to provide active-active redundancy for YARN and HDFS across data centers. Here’s why:

1. Bringing compute to data

Hadoop’s architectural advantage lies in bringing compute to data. Providing active-active (global) YARN accomplishes that on top of global HDFS across data centers.

2. Minimizing traffic on a WAN link

There are three types of data analytics schemes:

a) High-throughput analytics where the output data of a MapReduce job is small compared to the input.

Examples include weblogs, word count, etc.

b) Zero-throughput analytics where the output data of a MapReduce job is equal to the input. A sort operation is a good example of a job of this type.

c) Balloon-throughput analytics where the output is much larger than the input.

Local YARN can crunch the data and use global HDFS to redistribute for high throughput analytics. Keep in mind that this might require another MapReduce job running on the output results, however, which can add traffic to the WAN link. Global YARN mitigates this even further by distributing the computational load.

Last but not least, fault tolerance is required at the server, rack, and data center levels. Passive redundancy solutions can cause days of downtime before resuming. Active-active redundant YARN and HDFS provide zero-downtime solutions for MapReduce jobs and data.

To summarize, it is imperative for mission-critical applications to have active-active redundancy for HDFS and YARN. Not only does this protect data and prevent downtime, but it also allows big data to be processed at an accelerated rate by taking advantage of the aggregated CPU, network and storage of all servers across datacenters.

- Gurumurthy Yeleswarapu, Director of Engineering, WANdisco