Experiences with R and Big Data

The next releases of Subversion MultiSite Plus and Git MultiSite will embed Apache Flume for audit event collection and transmission. We’re taking an incremental approach to audit event collection and analysis, as the throughput at a busy site could generate a lot of data.

In the meantime, I’ve been experimenting with some more advanced and customized analysis. I’ve got a test system instrumented with a custom Flume configuration that pipes data into HBase instead of our Access Control Plus product. The question then is how to get useful answers out of HBase to questions like: What’s the distribution of SCM activity between the nodes in the system?

It’s actually not too bad to get that information directly from an HBase scan, but I also wanted to see some pretty charts. Naturally I turned to R, which led me again to the topic of how to use R to analyze Big Data.

A quick survey showed three possible approaches:

  • The RHadoop packages provided by Revolution Analytics, which includes RHBase and Rmr (R MapReduce)
  • The SparkR package
  • The Pivotal package that lets you analyze data in Hawq

I’m not using Pivotal’s distribution and I didn’t want to invest time in looking at a MapReduce-style analysis, so that left me with RHBase and Spark R.

Both packages were reasonably easy to install as these things go, and RHBase let me directly perform a table scan and crunch the output data set. I was a bit worried about what would happen once a table scan started returning millions of rows instead of thousands, so I wanted to try SparkR as well.

SparkR let me define a data source (in this case an export from HBase) and then run a functional reduce on it. In the first step I would produce some metric of interest (AuthZ success/failure for some combination of repository and node location) for each input line, and then reduce by key to get aggregate statistics. Nothing fancy, but Spark can handle a lot more data than R on a single workstation. The Spark programming paradigm fits nicely into R; it didn’t feel nearly as foreign as writing MapReduce or HBase scans. Of course, Spark is also considerably faster than normal MapReduce.

Here’s a small code snippet for illustration:

 

lines <- textFile(sc, "/home/vagrant/hbase-out/authz_event.csv")
mlines = lapply(lines, function(line) {
       return(list(key, metric))
       })
parts = reduceByKey(mlines, "+", 2L)
reduced = collect(parts)

 

In reality, I might use SparkR in a lambda architecture as part of my serving layer and RHBase as part of the speed layer.

It already feels like these extra packages are making Big Data very accessible to the tools that data scientists use, and given that data analysis is driving a lot of the business use cases for Hadoop, I’m sure we’ll see more innovation in this area soon.

0 Responses to “Experiences with R and Big Data”


  • No Comments

Leave a Reply