Monthly Archive for November, 2014

Another Top 5 List on Hadoop

Top 5 lists are always fun, and here’s another top 5 list on Hadoop.  It’s fairly familiar to anyone who follows the space, but it does highlight a few important trends.  A few comments and quibbles:

  • The fact that open source is the foundation of Big Data software shouldn’t be surprising even to the government anymore.  After all, even the secretive NSA has publicly acknowledged use of Hadoop.
  • The only controversial claim is that Hadoop is set to replace Enterprise Data Warehouses (EDWs).  I’ve heard a lot of arguments for and against that point over the last year.  It seems the Hadoop will at least complement EDWs and allow them to be used more efficiently, but complete replacement will depend on Hadoop maturing in a couple of key areas.  First, it will have to handle low-latency queries more efficiently.  Second, it will have to be as reliable and flexible as mature EDWs.  Keep an eye on projects like Apache Spark and, of course, Non-stop Hadoop in this area.
  • I agree that the Internet of Things (IoT) will be a new and important source of data for Hadoop in the future.  However, just a point of terminology: no one will “embed Hadoop” into  small devices.  Rather, data from these devices will be streamed into Hadoop.
  • Siri and the other smart assistants like Cortana are making waves, but IBM’s Watson seems to be years ahead in terms of analyzing complex unstructured situations.  Watson does use Hadoop for distributed processing but it has a much different paradigm than traditional MapReduce processing, and it needs to store a good chunk of its data in RAM.  That’s another sign that the brightest future for Hadoop will require new and exciting analytics frameworks.


Binary artifact management in Git

Paul Hammant has an interesting post on whether to check binary artifacts into source control.  Binary artifact management in Git is an interesting question and worth revisiting from time to time.

First, a bit of background.  Centralized SCM systems like Subversion and ClearCase are a bit more capable than Git when it comes to handling binary files.  One reason is sheer performance: since a Git repository has a full copy of the entire history, you just don’t want your clone (working copy) to be too big.  Another reason is assembling your working views.  ClearCase and to a lesser extent Subversion give you some nice tools to pick and choose pieces of a really big central repository and assemble the right working copy.  For example in a ClearCase config spec you can specify that you want a certain version of a third party library dependency.  Git on the other hand is pretty much all or nothing; it’s not easy to do a partial clone of a really big master repository.

Meanwhile, there had been a trend in development to move to more formal build and artifact management systems.  You could define a dependency graph in a tool like Maven and use Maven or Artifactory or even Jenkins to manage artifacts.  Along with offering benefits like not storing derived objects in source control, this trend covered off Git’s weak spot in handling binaries.

Now I’m not entirely sure about Paul’s reasons for recommending a switch back to managing binaries in Git.  Personally I prefer to properly capture dependencies in a configuration file like Maven’s POM, as I can exercise proper change control over that file.  The odd thing about SCM working view definitions like config specs is that they aren’t strongly versioned like source code files are.

But that being said,  you may prefer to store binaries in source control, or you may have binaries that are actually source artifacts (like graphics or multimedia for game development).  So is it hopeless with Git?

Not quite.  There are a couple of options worth looking at.  First, you could try out one of the Git extensions like git-annex or git-media.  These have been around a long time and work well in some use cases.  However they do require extra configuration and changes to the way you work.

Another interesting option is the use of shared back-end storage for cloned repositories.  Most Git repository management solutions that offer forks use these options for efficient use of back-end storage space.  If you can accept working on shared development infrastructure rather than your own workstation, then you can clone a Git repository using the file protocol with the -s option to share the object folder.  There’s also the -reference option to point a new Git clone at an existing object store.  These options make cloning relatively fast as you don’t have to create copies of large objects.  It doesn’t alleviate the pain of having the checked out files in your clone directory, but if you’re working on a powerful server that may be acceptable.  The bigger drawback to the file protocol is the lack of access control.

Management of large binaries is still an unsolved problem in the Git community.  There are effective alternatives and work-arounds but it’ll be interesting to see if anyone tries to solve the problem more systematically.

SmartSVN 8.6.2 General Access Now Available

We’re pleased to announce the latest release of SmartSVN, 8.6.2. SmartSVN is the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6.2 is available immediately for download from our website.

New Features include:

– Support for Mac OSX 10.10 Yosemite

Fixes include:

– Issue with log and graphing when no cache is created

For a full list of all improvements and features please see the changelog.

Contribute to further enhancements

Many of the issues resolved in SmartSVN were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.


最近のHadoop関連の展示会の状況の報告です。10月末にNYCでStrata+Hadoop World が開催されました。今回は6000人の参加。5年前はTシャツ、ジーパンの人ばかりでしたが(それはそれで技術のフィードバックを得る上で大切なことですが)、今年は背広を来たビジネスマンが増えました。Hadoopが実システムに使われ始めたという現れかと思います。Wikibonの調査によれば、87%のユーザーがHadoopを複数のデータセンタで稼働させ、72%が24×7を必要としています。弊社のブースもNon-Stop Hadoopを理解しようという方が多く訪問されました。医療での事例は以下でご覧頂けます(日本語の字幕あり)

一方、日本で11月初めに行われたCloudera World Tokyo 2014の参加者は650名程度。弊社も講演を行いました。Non-Stop Hadoopというテーマでまだ大規模な実システムが少ない日本の現状では、低調になるかと心配していたのですが60名の方に参加頂きました。日本でもそろそろ、実システムでのHadoopが必要になってきているのかと思われます。Non-Stop Hadoopは24×7の稼働を可能にしますが、データの複製を持つこともできるのでシステムを止めない移行、Version UPを可能とします。まだαリリース段階ですが、異なるディストリビューション間でデータを共有することも可能になります。最適なHadoopをベンダーロックインなしに使えるようになります。Hadoopが実システムに移行していく際に遭遇する色々な問題に対応できるものと思っております。


About Kenji Ogawa (小川 研之)

WANdisco社で2013年11月より日本での事業を展開中。 以前は、NECで国産メインフレーム、Unix、ミドルウェアの開発に従事。その後、シリコンバレーのベンチャー企業開拓、パートナーマネージメント、インドでのオフショア開発に従事。

Starting at WANdisco: Gordon Vaughan, SDM

Hello world.

So, I’ve been asked to write a blog about my experience of starting at Wandisco. It was only 5 weeks ago, but it still feels a bit weird to write about it because it simultaneously feels like yesterday and a year ago, in equally positive measures. I’ll try to give an idea of why that is, and why I’m happy that I chose WANdisco as the next step in my career.

With my previous employer, I’d had a brilliant time for around 3-4 years; working my way up, gaining experience, pushing myself to go above and beyond every day. It was fantastic. Then, after a great run, things started to slow. The business got quite staid, opportunities to learn dried up and instead of progressing we were living in a perpetual ‘firefighting’ limbo. At the same time, my employer was owned by a larger organisation that was gradually, but perceptibly, making changes that impacted on the way our business performed. I’m sure many readers will have seen their employer go through similar absorption, and felt the tremors themselves first hand.

After a couple of years of stagnant career progress, albeit in a comfortable and fairly happy setting, an opportunity was pointed out to me at WANdisco.

It’s important at this point that I make something clear: I am not a technical expert. I’m one of those people that complete novices think are magical because I know how to use Google. On your initial Googling of WANdisco, that could seemingly rule you out because they talk in confident terms about their MultiSite products, enabling active-active replication of development environments across the globe at LAN speed with… Nope, I’m lost again… When I stepped away from Google and thought in isolation about what it was they were saying, it made a lot more sense. A change management system, that runs globally as fast as locally, that’s the same wherever you access it from. We forget sometimes that massive files take ages to download over large geographic areas, and if that’s happening all the time then how much time is lost waiting for updates? That, plus the fact MultiSite means, by its very nature, having multiple copies, you also have effective disaster recovery. I suddenly found myself interested.

Have to admit that Big Data was the product that made me really excited. Some of the stats around production of data are mind-blowing. By the time you have read this far down the page, it’s likely the amount of data globally recorded outstrips anything from the early 90s back to the beginning of time. All that data needs to go somewhere and it’s probably all usable, but how? I mean, physically, how? I saw a video by David Richards, the man who started WANdisco, explaining that Big Data had been used in the automotive industry to accurately predict the failure rate of components on cars to make pro-active repair possible. The video went on to mention how that could apply to healthcare, and then that wave of realisation hit. Big Data could well be the biggest thing to happen to this world since the Internet itself. How *amazing* would it be to help our customers build and shape that product to their own specification? Notice the ‘our’ in that sentence – I was already on board in my mind 🙂

After polishing my CV, having a shave and a haircut and all the other prep you would normally do for an interview, that ‘our’ became a reality 6 weeks later.

The role I fulfil is that of Service Delivery Manager. In title, that meant doing exactly the same thing as I did in my old workplace. In reality, it was everything that role should have been, and more besides. We perform quarterly service reviews with our customers, whether they have needed our support team or not, to talk to them about how we’re doing from a global support perspective, how the product is working for them, if there are any challenges or changes coming up, etc. That’s a mandated part of the service and not a nice-to-have – unless of course the customer chooses not to have them! What’s key is that we’re always talking to our customers, always looking for the next hurdle before it hits us, always being open and honest about our performance. It’s that approach that we believe will provide us the valuable intelligence we need to keep evolving, and showing our customers that we’re listening and adapting constantly to their needs.

The thought of having these kinds of conversations with customers without product knowledge was, frankly, terrifying. Thankfully, WANdisco had a full induction plan in place to ensure I had a full days’ worth of training across Subversion, MultiSite and Big Data to get the basics, and since then it’s been topped up by more in-depth sessions, particularly on Big Data. What I think is brilliant about the industry we’re in is that a lot of the software and processes we work with are open source, and there’s a wealth of information available on them. It’s not like the textbook models of old; it’s seminars, product demonstrations, lectures and other learning tools presented in engaging formats across the internet. YouTube has been a fantastic resource for learning; where previously I’d used it solely for watching Nanners and Sips playing various games, now I find myself lost in hours of concepts and theories that are still sinking in. It’s the diversity, yet relevance, of the information available to you that simply boggles the mind, and it’s all so new and rapidly changing that it’s compelling. WANdisco provide a good proportion of that content, either themselves or via exposes/conferences, which really makes you feel like you’re part of an important player in the community.

Of course, it’s very early days for me in learning, and there’s a strong chance that I’ll never have the knowledge that some of the people around the business hold. I wouldn’t have it any other way though; I love that we have so many brilliant minds across multiple sites. The culture within WANdisco is very similar to that of the open source community as a whole, in that we share, we collaborate, we discuss, and everyone learns. Everyone is approachable, and you can bet if the first person you speak to doesn’t have the answer, they will be able to walk you over to someone who does. In my role it’s vital that I have access to that knowledge quickly and easily, so it’s fantastic to have that ‘resource’ so accessible.

At this point I need to confess something: it’s now 13 weeks since I started, and it’s taken me 8 weeks to write this because I’ve been so busy. I’ve loved every second of it, and I love the fact that when I see a clock say 4pm I now think ‘where has the day gone?’ instead of ‘oh no, there’s still 2 hours left…’ There aren’t enough hours in the day, genuinely.

I’ll sign off there, but if you’re looking at WANdisco as a potential employer, or even if you think you’re happy where you are but find yourself reading this for some bizarre reason, do take a look at our careers site. It’s a great place to work, a great place to learn, and simply a great place to be.

WANdisco Engineering Offsite 2014

Hello from Belfast!

I’ve been enjoying a quick visit to Belfast this week to participate in WANdisco’s engineering offsite meeting. WANdisco has engineering offices in California, England, Northern Ireland, and India, and it’s really a pleasure to work with great people around the world. Belfast is also a terrific city to visit, with an amazing local food scene and a fun downtown area.

WANdisco is a fast-paced company and it’s always interesting to take a breath and catch up with colleagues that you normally only see on video conferencing. We’ve achieved an amazing amount in the past year, launching two new products (Access Control Plus and Gerrit integration for Git MultiSite) with a few more in various stages of work. Every WANdisco office has people with different viewpoints and skill sets, but keeping the communication channels open requires an investment in keeping in touch. Of course from one perspective it’s really easy: we use our MultiSite products internally, so sharing source code is dead simple…

Anyway, our batteries are recharged, we’ve got a plan for the rest of this year going into 2015, and we’re going to continue to deliver products that solve tough problems and delight our customers. That’s all for now – someone said there were pubs in Ireland, so I’m off to explore!


SmartSVN 7.6.4 With SSL Fix Available Now

We’re pleased to announce the release of SmartSVN 7.6.4, available for Mac, Windows and Linux. SmartSVN is available for immediate download from our website.

This is an update to the older version of SmartSVN, based on SVNkit rather than JavaHL. The update includes a fix to the POODLE bug that affects SSLv3.

For a full list of changes please see the changelog.

If you’ve any requests or feedback please drop a post into our dedicated SVN Forum.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Thoughts on Hadoop architecture

Gartner just released a new research note on comparing Hadoop distributions.  Although the note itself is behind a paywall, some of the key findings are posted openly.  And I find it very interesting that when Gartner shares its thoughts on Hadoop architecture and distributions, they tend to focus much more on the big picture of how to design the best Hadoop for your business.

The item that stood out most was the finding that Hadoop is becoming the default cluster management solution.  YARN really changed the focus of Hadoop from a batch processing system to a general purpose platform for large scale data management and computation.  The Hadoop ecosystem is evolving so quickly that it can be frightening, but you do get some ‘future proofing’ as well – whenever the next big thing comes along, chances are it will run on Hadoop, just like Spark does.

On a related note, Gartner also recommends focusing on your ideal architecture rather than on the nuts-and-bolts of any particular distribution.  That’s just good sense; if you know what you want to do with your data, chances are Hadoop is now mature enough to accommodate you.  And of course, WANdisco provides some clever solutions to help all of those Hadoop clusters work better together.

Anyway, the research note is a nice read, particularly if you’re feeling overwhelmed by how complicated Hadoop is getting.