Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

Binary artifact management in Git

Paul Hammant has an interesting post on whether to check binary artifacts into source control.  Binary artifact management in Git is an interesting question and worth revisiting from time to time.

First, a bit of background.  Centralized SCM systems like Subversion and ClearCase are a bit more capable than Git when it comes to handling binary files.  One reason is sheer performance: since a Git repository has a full copy of the entire history, you just don’t want your clone (working copy) to be too big.  Another reason is assembling your working views.  ClearCase and to a lesser extent Subversion give you some nice tools to pick and choose pieces of a really big central repository and assemble the right working copy.  For example in a ClearCase config spec you can specify that you want a certain version of a third party library dependency.  Git on the other hand is pretty much all or nothing; it’s not easy to do a partial clone of a really big master repository.

Meanwhile, there had been a trend in development to move to more formal build and artifact management systems.  You could define a dependency graph in a tool like Maven and use Maven or Artifactory or even Jenkins to manage artifacts.  Along with offering benefits like not storing derived objects in source control, this trend covered off Git’s weak spot in handling binaries.

Now I’m not entirely sure about Paul’s reasons for recommending a switch back to managing binaries in Git.  Personally I prefer to properly capture dependencies in a configuration file like Maven’s POM, as I can exercise proper change control over that file.  The odd thing about SCM working view definitions like config specs is that they aren’t strongly versioned like source code files are.

But that being said,  you may prefer to store binaries in source control, or you may have binaries that are actually source artifacts (like graphics or multimedia for game development).  So is it hopeless with Git?

Not quite.  There are a couple of options worth looking at.  First, you could try out one of the Git extensions like git-annex or git-media.  These have been around a long time and work well in some use cases.  However they do require extra configuration and changes to the way you work.

Another interesting option is the use of shared back-end storage for cloned repositories.  Most Git repository management solutions that offer forks use these options for efficient use of back-end storage space.  If you can accept working on shared development infrastructure rather than your own workstation, then you can clone a Git repository using the file protocol with the -s option to share the object folder.  There’s also the -reference option to point a new Git clone at an existing object store.  These options make cloning relatively fast as you don’t have to create copies of large objects.  It doesn’t alleviate the pain of having the checked out files in your clone directory, but if you’re working on a powerful server that may be acceptable.  The bigger drawback to the file protocol is the lack of access control.

Management of large binaries is still an unsolved problem in the Git community.  There are effective alternatives and work-arounds but it’ll be interesting to see if anyone tries to solve the problem more systematically.

SmartSVN 8.6.2 General Access Now Available

We’re pleased to announce the latest release of SmartSVN, 8.6.2. SmartSVN is the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6.2 is available immediately for download from our website.

New Features include:

– Support for Mac OSX 10.10 Yosemite

Fixes include:

– Issue with log and graphing when no cache is created

For a full list of all improvements and features please see the changelog.

Contribute to further enhancements

Many of the issues resolved in SmartSVN were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

最近のトレードショーから見るHadoopの動向

最近のHadoop関連の展示会の状況の報告です。10月末にNYCでStrata+Hadoop World が開催されました。今回は6000人の参加。5年前はTシャツ、ジーパンの人ばかりでしたが(それはそれで技術のフィードバックを得る上で大切なことですが)、今年は背広を来たビジネスマンが増えました。Hadoopが実システムに使われ始めたという現れかと思います。Wikibonの調査によれば、87%のユーザーがHadoopを複数のデータセンタで稼働させ、72%が24×7を必要としています。弊社のブースもNon-Stop Hadoopを理解しようという方が多く訪問されました。医療での事例は以下でご覧頂けます(日本語の字幕あり)

一方、日本で11月初めに行われたCloudera World Tokyo 2014の参加者は650名程度。弊社も講演を行いました。Non-Stop Hadoopというテーマでまだ大規模な実システムが少ない日本の現状では、低調になるかと心配していたのですが60名の方に参加頂きました。日本でもそろそろ、実システムでのHadoopが必要になってきているのかと思われます。Non-Stop Hadoopは24×7の稼働を可能にしますが、データの複製を持つこともできるのでシステムを止めない移行、Version UPを可能とします。まだαリリース段階ですが、異なるディストリビューション間でデータを共有することも可能になります。最適なHadoopをベンダーロックインなしに使えるようになります。Hadoopが実システムに移行していく際に遭遇する色々な問題に対応できるものと思っております。

Starting at WANdisco: Gordon Vaughan, SDM

Hello world.

So, I’ve been asked to write a blog about my experience of starting at Wandisco. It was only 5 weeks ago, but it still feels a bit weird to write about it because it simultaneously feels like yesterday and a year ago, in equally positive measures. I’ll try to give an idea of why that is, and why I’m happy that I chose WANdisco as the next step in my career.

With my previous employer, I’d had a brilliant time for around 3-4 years; working my way up, gaining experience, pushing myself to go above and beyond every day. It was fantastic. Then, after a great run, things started to slow. The business got quite staid, opportunities to learn dried up and instead of progressing we were living in a perpetual ‘firefighting’ limbo. At the same time, my employer was owned by a larger organisation that was gradually, but perceptibly, making changes that impacted on the way our business performed. I’m sure many readers will have seen their employer go through similar absorption, and felt the tremors themselves first hand.

After a couple of years of stagnant career progress, albeit in a comfortable and fairly happy setting, an opportunity was pointed out to me at WANdisco.

It’s important at this point that I make something clear: I am not a technical expert. I’m one of those people that complete novices think are magical because I know how to use Google. On your initial Googling of WANdisco, that could seemingly rule you out because they talk in confident terms about their MultiSite products, enabling active-active replication of development environments across the globe at LAN speed with… Nope, I’m lost again… When I stepped away from Google and thought in isolation about what it was they were saying, it made a lot more sense. A change management system, that runs globally as fast as locally, that’s the same wherever you access it from. We forget sometimes that massive files take ages to download over large geographic areas, and if that’s happening all the time then how much time is lost waiting for updates? That, plus the fact MultiSite means, by its very nature, having multiple copies, you also have effective disaster recovery. I suddenly found myself interested.

Have to admit that Big Data was the product that made me really excited. Some of the stats around production of data are mind-blowing. By the time you have read this far down the page, it’s likely the amount of data globally recorded outstrips anything from the early 90s back to the beginning of time. All that data needs to go somewhere and it’s probably all usable, but how? I mean, physically, how? I saw a video by David Richards, the man who started WANdisco, explaining that Big Data had been used in the automotive industry to accurately predict the failure rate of components on cars to make pro-active repair possible. The video went on to mention how that could apply to healthcare, and then that wave of realisation hit. Big Data could well be the biggest thing to happen to this world since the Internet itself. How *amazing* would it be to help our customers build and shape that product to their own specification? Notice the ‘our’ in that sentence – I was already on board in my mind :)

After polishing my CV, having a shave and a haircut and all the other prep you would normally do for an interview, that ‘our’ became a reality 6 weeks later.

The role I fulfil is that of Service Delivery Manager. In title, that meant doing exactly the same thing as I did in my old workplace. In reality, it was everything that role should have been, and more besides. We perform quarterly service reviews with our customers, whether they have needed our support team or not, to talk to them about how we’re doing from a global support perspective, how the product is working for them, if there are any challenges or changes coming up, etc. That’s a mandated part of the service and not a nice-to-have – unless of course the customer chooses not to have them! What’s key is that we’re always talking to our customers, always looking for the next hurdle before it hits us, always being open and honest about our performance. It’s that approach that we believe will provide us the valuable intelligence we need to keep evolving, and showing our customers that we’re listening and adapting constantly to their needs.

The thought of having these kinds of conversations with customers without product knowledge was, frankly, terrifying. Thankfully, WANdisco had a full induction plan in place to ensure I had a full days’ worth of training across Subversion, MultiSite and Big Data to get the basics, and since then it’s been topped up by more in-depth sessions, particularly on Big Data. What I think is brilliant about the industry we’re in is that a lot of the software and processes we work with are open source, and there’s a wealth of information available on them. It’s not like the textbook models of old; it’s seminars, product demonstrations, lectures and other learning tools presented in engaging formats across the internet. YouTube has been a fantastic resource for learning; where previously I’d used it solely for watching Nanners and Sips playing various games, now I find myself lost in hours of concepts and theories that are still sinking in. It’s the diversity, yet relevance, of the information available to you that simply boggles the mind, and it’s all so new and rapidly changing that it’s compelling. WANdisco provide a good proportion of that content, either themselves or via exposes/conferences, which really makes you feel like you’re part of an important player in the community.

Of course, it’s very early days for me in learning, and there’s a strong chance that I’ll never have the knowledge that some of the people around the business hold. I wouldn’t have it any other way though; I love that we have so many brilliant minds across multiple sites. The culture within WANdisco is very similar to that of the open source community as a whole, in that we share, we collaborate, we discuss, and everyone learns. Everyone is approachable, and you can bet if the first person you speak to doesn’t have the answer, they will be able to walk you over to someone who does. In my role it’s vital that I have access to that knowledge quickly and easily, so it’s fantastic to have that ‘resource’ so accessible.

At this point I need to confess something: it’s now 13 weeks since I started, and it’s taken me 8 weeks to write this because I’ve been so busy. I’ve loved every second of it, and I love the fact that when I see a clock say 4pm I now think ‘where has the day gone?’ instead of ‘oh no, there’s still 2 hours left…’ There aren’t enough hours in the day, genuinely.

I’ll sign off there, but if you’re looking at WANdisco as a potential employer, or even if you think you’re happy where you are but find yourself reading this for some bizarre reason, do take a look at our careers site. It’s a great place to work, a great place to learn, and simply a great place to be.

WANdisco Engineering Offsite 2014

Hello from Belfast!

I’ve been enjoying a quick visit to Belfast this week to participate in WANdisco’s engineering offsite meeting. WANdisco has engineering offices in California, England, Northern Ireland, and India, and it’s really a pleasure to work with great people around the world. Belfast is also a terrific city to visit, with an amazing local food scene and a fun downtown area.

WANdisco is a fast-paced company and it’s always interesting to take a breath and catch up with colleagues that you normally only see on video conferencing. We’ve achieved an amazing amount in the past year, launching two new products (Access Control Plus and Gerrit integration for Git MultiSite) with a few more in various stages of work. Every WANdisco office has people with different viewpoints and skill sets, but keeping the communication channels open requires an investment in keeping in touch. Of course from one perspective it’s really easy: we use our MultiSite products internally, so sharing source code is dead simple…

Anyway, our batteries are recharged, we’ve got a plan for the rest of this year going into 2015, and we’re going to continue to deliver products that solve tough problems and delight our customers. That’s all for now – someone said there were pubs in Ireland, so I’m off to explore!

 

SmartSVN 7.6.4 With SSL Fix Available Now

We’re pleased to announce the release of SmartSVN 7.6.4, available for Mac, Windows and Linux. SmartSVN is available for immediate download from our website.

This is an update to the older version of SmartSVN, based on SVNkit rather than JavaHL. The update includes a fix to the POODLE bug that affects SSLv3.

For a full list of changes please see the changelog.

If you’ve any requests or feedback please drop a post into our dedicated SVN Forum.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Thoughts on Hadoop architecture

Gartner just released a new research note on comparing Hadoop distributions.  Although the note itself is behind a paywall, some of the key findings are posted openly.  And I find it very interesting that when Gartner shares its thoughts on Hadoop architecture and distributions, they tend to focus much more on the big picture of how to design the best Hadoop for your business.

The item that stood out most was the finding that Hadoop is becoming the default cluster management solution.  YARN really changed the focus of Hadoop from a batch processing system to a general purpose platform for large scale data management and computation.  The Hadoop ecosystem is evolving so quickly that it can be frightening, but you do get some ‘future proofing’ as well – whenever the next big thing comes along, chances are it will run on Hadoop, just like Spark does.

On a related note, Gartner also recommends focusing on your ideal architecture rather than on the nuts-and-bolts of any particular distribution.  That’s just good sense; if you know what you want to do with your data, chances are Hadoop is now mature enough to accommodate you.  And of course, WANdisco provides some clever solutions to help all of those Hadoop clusters work better together.

Anyway, the research note is a nice read, particularly if you’re feeling overwhelmed by how complicated Hadoop is getting.

Solving the 3 biggest Hadoop challenges

A colleague recently pointed me to this great article on the 3 biggest Hadoop challenges. The article is written by Sean Suchter, the CEO of Pepperdata, and offers a practical perspective on how these challenges are seen and managed through workarounds.

Ultimate none of those workarounds are very satisfactory. Fortunately, Non-Stop Hadoop offers a compelling way to solve these challenges, either in whole or in part.

Resource contention due to mixed workloads and multi-tenancy environments

This problem seems to be the biggest driver of Hadoop challenges. Of the many workarounds Suchter discusses, all seem either manually intensive (tweaking Hadoop parameters for better performance) or limiting from a business perspective (gating production jobs or designing workflows to avoid bottlenecks).

As I’ve written before, the concept of a logical data lake with a unified HDFS namespace largely overcomes this challenge. Non-Stop Hadoop lets you set up multiple clusters at one or several locations, all sharing the same data – unless you choose to restrict the sharing through selective replication. Now you can run jobs on the most appropriate cluster (e.g. using high-memory nodes for in-memory processing) and avoid the worst of the resource contention.

Difficult troubleshooting

We all know the feeling of being under the gun while an important production system is offline. While the Hadoop ecosystem will surely mature in the coming years, Non-Stop Hadoop gives you built-in redundancy. Lose a NameNode? You’ve got 8 more. The whole cluster is shot? You’ve got two others that can fill in the gap…immediately.

Inefficient use of hardware

It’s really a tough problem: you need enough hardware to handle peak bursts of activity, but then a lot of it will sit idle during non-peak times. Non-Stop Hadoop gives you a clever solution: put your backup cluster to work. The backup cluster is effectively just an extension of the primary cluster when you use Non-Stop Hadoop. Point some jobs at the second cluster during periods of peak workload and you’ll have easy load balancing.

To borrow an analogy from the electric power industry, do you want to maintain expensive and inefficient peaker units for the two hours when the air-conditioning load is straining the grid? Or do you want to invest in distributed power setups like solar, wind, and neighborhood generation?

A better Hadoop

Non-Stop Hadoop is Hadoop…just better. Let’s solve your problems together.

GitLab and Git MultiSite: Architecture

The architecture of GitLab running with Git MultiSite is worth exploring.  In the interest of saving a thousand words, here’s the picture.

gitlab-deployment-ms

As you can see, the topology is quite a bit more complex when you use a Git repository management system that uses multiple data stores.  Git MultiSite coordinates with GitLab to replicate all repository activity, including wiki repositories.  Git MultiSite also replicates some important files like the GitLab authorization files for access control.

As for the other data stores, we’re relying on GitLab’s ability to run with multiple web apps connected to a single logical relational database and a single logical Redis database.  They can be connected directly or via pass-through mirrors.  Kudos to the GitLab team for a clean architecture that facilitates this multi-master setup; they’ve avoid some of the nasty caching issues that other applications encounter.  This topology is in fact similar to what you can do with GitLab when you use shared storage for the repositories.  Git MultiSite provides the missing link: full repository replication with robust performance in a WAN environment and a shared-nothing architecture.

Short of relying completely on Git as a data store for code reviews and other metadata, this architecture is about as clean as it gets.

Now for some nuts and bolts…

We are making some simplifying assumptions for the first release of GitLab integration.  The biggest assumption is that all nodes run all the software, and that all repositories originate in GitLab and exist on all nodes.  We plan to relax some of these constraints in the future.

And what about performance?  Well, I’m happy to relate that you’ll see very good performance in all cases and much improved performance in some cases.  Balancing repository activity across several nodes gives better throughput when the system is under practical load.

perf

Well, that picture saved a few words, but nothing speaks better than a demo or a proof-of-concept deployment.  Contact us for details!

 

Scalable Social Coding

I’m very pleased to announce that Git MultiSite now formally supports GitLab, a leading on-premise Git collaboration and management suite.  With this and future integrations, Git MultiSite’s promise of a truly distributed Git solution is coming to fruition.

WANdisco first announced Git MultiSite in 2013.  Git MultiSite provides our patented active-active replication for Git, giving you a deployment of fully writable peer nodes instead of a single ‘master’ Git server.  The next step came with Access Control Plus in 2014, which brought Git repositories under a unified security and management umbrella.

And now we’re tackling the final piece of the puzzle.  Those of you active in the Git ecosystem know that most companies deploy Git as part of an integrated repository management solution that also provides social coding and collaboration tools — code review, wikis, and sometimes lightweight issue tracking.

In one sense, Git MultiSite is still a foundational technology that can replicate Git repositories managed by almost any system.  And indeed we do have customers who deployed Git MultiSite with GitLab long before we did any extra work in this area.

The devil is in the details though.  For one thing, some code review systems actually modify a Git repository using non-standard techniques in response to code review activity like approving a merge request.  So we had to make a few under-the-hood modifications to support that workflow.

Perhaps more importantly, Git MultiSite and Access Control Plus provide consistent (and writable) access to repository and access control data at all sites.  But if the collaboration tool is a key part of the workflow, you really need that portal to be available at every node as well.  And we’ve worked hard with the GitLab crew to make that possible.

So what does that all mean?  You get it all:

  • LAN speed access to repositories at every site
  • A built-in HA/DR strategy for zero down time
  • Easy scalability for build automation or a larger user base
  • Fast access to the GitLab UI for code reviews and more at every site
  • Consistent access control at every site
  • All backed by WANdisco’s premier support options

Interested?  I’ll be publishing more details on the integration in the near future.  In the meantime, give us a call and we’ll give you a full briefing.