Tag Archive for 'HDFS'

The inspiration for WANdisco Fusion

Screen Shot 2015-04-21 at 10.08.22 PM

Roughly two years ago, we sat down to start work on a project that finally came to fruition this week.

At that meeting, we had set ourselves the challenge of redefining the storage landscape. We wanted to map out a world where there was complete shared storage, but where the landscape remained entirely heterogeneous.

Why? Because we’d witnessed the beginnings of a trend that has only grown more pronounced with the passage of time.

From the moment we started engaging with customers, we were struck by the extreme diversity of their storage environments. Regardless of whether we were dealing with a bank, a hospital or utility provider, different types of storage had been introduced across every organization for a variety of use cases.

In time, however, these same companies wanted to start integrating their different silos of data, whether to run real-time analytics or to gain a full 360 perspective of performance. Yet preserving diversity across data center was critical, given that each storage type has its own strengths.

They didn’t care about uniformity. They cared about performance and this meant being able to have the best of both worlds. Being able to deliver this became the Holy Grail – at least in the world of data centers.

This isn’t quite The Gordian Knot but it’s certainly a very difficult, complex problem and possibly one that could only be solved with our core, patented IP DConE.

Then we had a breakthrough.

Months later and I’m proud to formally release WANdisco Fusion (WD Fusion), the only product that enables WAN-scope active-active synchronization of different storage systems into one place.

What does this mean in practice? Well it means that you can use Hadoop distributions like Hortonworks, Cloudera or Pivotal for compute, Oracle BDA for fast compute, EMC Isilon for dense storage. You could even use a complete variety of Hadoop distros and versions. Whatever your set-up, with WD Fusion you can leverage new and existing storage assets immediately.

With it, Hadoop is transformed from being something that runs within a data center into an elastic platform that runs across multiple data centers throughout the world. WD Fusion allows you to update your storage infrastructure one data center at a time, without impacting your application ability or by having to copy vast swathes of data once the update is done.

When we were developing WD Fusion we agreed upon two things. First, we couldn’t produce anything that made changes to the underlying storage system – this had to behave like a client application. Second, anything we created had to enable a complete single global name-space across an entire storage infrastructure.

With WD Fusion, we allow businesses to bring together different storage systems by leveraging our existing intellectual property – the same Paxos-powered algorithm behind Non-Stop Hadoop, Subversion Multisite and Git Multisite – without making any changes to the platform you’re using.

Another way of putting it is we’ve managed to spread our secret sauce even further.

We have some of the best computer scientists in the world working at WANdisco, but I’m confident that this is the most revolutionary project any of us have ever worked on.

I’m delighted to be unveiling WD Fusion. It’s a testament to the talent and character of our firm, the result of looking at an impossible scenario and saying: “Challenge accepted.”

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

Big Data ETL Across Multiple Data Centers

Scientific applications, weather forecasting, click-stream analysis, web crawling, and social networking applications often have several distributed data sources, i.e., big data is collected in separate data center locations or even across the Internet.

In these cases, the most efficient architecture for running extract, transform, load (ETL) jobs over the entire data set becomes nontrivial.

Hadoop provides the Hadoop Distributed File System (HDFS) for storage and YARN (Yet Another Resource Negotiator) as the programming model in Hadoop 2.0. ETL jobs use the MapReduce programming model to run on the YARN framework.

Though these are adequate for a single data center, there is a clear need to enhance them for multi-data center environments. In these instances, it is important to provide active-active redundancy for YARN and HDFS across data centers. Here’s why:

1. Bringing compute to data

Hadoop’s architectural advantage lies in bringing compute to data. Providing active-active (global) YARN accomplishes that on top of global HDFS across data centers.

2. Minimizing traffic on a WAN link

There are three types of data analytics schemes:

a) High-throughput analytics where the output data of a MapReduce job is small compared to the input.

Examples include weblogs, word count, etc.

b) Zero-throughput analytics where the output data of a MapReduce job is equal to the input. A sort operation is a good example of a job of this type.

c) Balloon-throughput analytics where the output is much larger than the input.

Local YARN can crunch the data and use global HDFS to redistribute for high throughput analytics. Keep in mind that this might require another MapReduce job running on the output results, however, which can add traffic to the WAN link. Global YARN mitigates this even further by distributing the computational load.

Last but not least, fault tolerance is required at the server, rack, and data center levels. Passive redundancy solutions can cause days of downtime before resuming. Active-active redundant YARN and HDFS provide zero-downtime solutions for MapReduce jobs and data.

To summarize, it is imperative for mission-critical applications to have active-active redundancy for HDFS and YARN. Not only does this protect data and prevent downtime, but it also allows big data to be processed at an accelerated rate by taking advantage of the aggregated CPU, network and storage of all servers across datacenters.

– Gurumurthy Yeleswarapu, Director of Engineering, WANdisco

Continuous Availability versus High Availability

Wikipedia’s page on Continuous Availability is available here:

http://en.wikipedia.org/wiki/Continuous_availability

A quick perusal tells us that High Availability can be ‘accomplished by providing redundancy or quickly restarting failed components’. This is very different from ‘Continuously Available’ systems that enable continuous operation through planned and unplanned outages of one or more components.

As large global organizations move from using Hadoop for batch storage and retrieval to mission critical real-time applications where the cost of even one minute of downtime is unacceptable, mere high availability will not be enough.

Solutions such as HDFS NameNode High Availability (NameNode HA) that come with Apache Hadoop 2.0 and Hadoop distributions based on it are subject to downtimes of 5 to 15 minutes.  In addition, NameNode HA, is limited to a single data center, and only one NameNode can be active at a time, creating a performance as well as an availability bottleneck. Deployments that incorporate WANdisco Non-Stop Hadoop are not subject to any downtime, regardless of whether a single NameNode server or an entire data center goes offline. There is no need for maintenance windows with Non-Stop Hadoop, since you can simply bring down the NameNode servers one at a time, and perform your maintenance operations.  The remaining active NameNodes continue to support real-time client applications as well as batch jobs.

The business advantages of a Continuously Available, multi-data center aware systems are well known to IT decision makers. Here are some examples that illustrate how both real-time and batch applications can benefit and new use cases can be supported:

  • A Batch Big Data DAG is a chain of applications wherein the output of a preceding job is used as the input to a subsequent job. At companies such as Yahoo, these DAGs take six to eight hours to run, and they are run every day. Fifteen minutes of NameNode downtime may cause one of these jobs to fail. As a result of this single failure, the entire DAG may not run to completion, creating delays that can last many hours.
  • Global clickstream analysis applications that enable businesses to see and respond to customer behavior or detect potentially fraudulent activity in real-time.
  • A web site or service built to use HBase as a backing store will be down if the HDFS underlying HBase goes down when the NameNode fails. This is likely to result in lost revenue and erode customer goodwill.  Non-Stop Hadoop eliminates this risk.
  • Continuous Availability systems such as  WANdisco Non-Stop Hadoop are administered with  fewer staff. This is because failure of one out of five NameNodes is not an emergency event. It can be dealt with by staff during regular business hours. Significant cost savings in staffing can be achieved since Continuously Available systems do not require 24×7 sysadmin staff .  In addition, in a distributed multi-data center environment, Non-Stop Hadoop can be managed from one location.
  • There are no passive or standby servers or data centers that essentially sit idle until disaster strikes.  All servers are active and provide full read and write access to the same data at every location.

See a demo of Non-Stop Hadoop for Cloudera and Non-Stop Hadoop for Hortonworks in action and read what leading industry analysts like Gartner’s Merv Adrian have to say about the need for continuous Hadoop availability.

 

avatar

About Jagane Sundar

WANdisco’s February Roundup

This month, we launched a trio of innovative Hadoop products: the world’s first production-ready distro; a wizard-driven management dashboard; and the first and only 100% uptime solution for Apache Hadoop.

hadoop big data

We started this string of Big Data announcements with WANdisco Distro (WDD) a fully tested, free-to-download version of Apache Hadoop 2. WDD is based on the most recent Hadoop release, includes all the latest fixes and undergoes the same rigorous quality assurance process as our enterprise software solutions.

This release paved the way for our enterprise Hadoop solutions, and we announced the WANdisco Hadoop Console (WHC) shortly after. WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

The final product in this month’s Big Data announcements was WANdisco Non-Stop NameNode. Our patented technology makes WANdisco Non-Stop Namenode the first and only 100% uptime solution for Hadoop, and offers a string of benefits for enterprise users:

  • Automatic failover and recovery
  • Automatic continuous hot backup
  • Removes single point of failure
  • Eliminates downtime and data loss
  • Every NameNode server is active and supports simultaneous read and write requests
  • Full support for HBase

To support the needs of the Apache Hadoop community, we’ve also launched a dedicated Hadoop forum. At this forum, users can get advice on their Hadoop installation and connect with fellow users, including WANdisco’s core Apache Hadoop developers Dr. Konstantin V. Shvachko, Dr. Konstantin Boudnik, and Jagane Sundar.

subversion

For Apache Subversion users, we announced the next webinars in our free training series:

  • Subversion Administration – everything you need to administer a Subversion development environment
  • Introduction to SmartSVN – a short introduction to how Subversion works with the SmartSVN graphical client
  • Checkout Command – how to get the most out of the checkout command, and the meaning of the various error messages you may encounter
  • Commit Command – learn more about this command, including diff usage, working with unversioned files and changelists
  • Introduction to Git – everything a new user needs to get started with Git
  • Hook Scripts – how to use hook scripts to automate tasks such as email notifications, backups and access control
  • Advanced Hook Scripts – an advanced look at hook scripts, including using a config file with hook scripts and passing data to hook scripts

We’ve announced an ongoing series of free webinars, which demonstrate how you can overcome these challenges from an administrative, business and IT perspective, and get the most out of deploying Subversion in an enterprise environment. These ‘Scaling Subversion for the Enterprise’ webinars will be conducted by our expert Solution Architect three times a week (Tuesday, Wednesday and Thursday) at 10.00am PST/1.00pm EST, and will cover:

  • The latest technology that can help you overcome the limitations and risks associated with globally distributed deployments
  • Answers to your business-specific questions
  • How to solve critical issues
  • The free resources and offers that can help solve your business challenges

WANdisco Announces Free Online Hadoop Training Webinars

We’re excited to announce a series of free one-hour online Hadoop training webinars, starting with four sessions in March and April. Time will be allowed for audience Q&A at the end of each session.

Wednesday, March 13 at 10:00 AM Pacific, 1:00 PM Eastern

A Hadoop Overview” will cover Hadoop, from its history to its architecture as well as:

  • HDFS, MapReduce, and HBase
  • Public and private cloud deployment options
  • Highlights of common business use cases and more

March 27, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A Deep Dive” covers Hadoop misconceptions (not all clusters include thousands of machines) and:

  • Real world Hadoop deployments
  • Review of major Hadoop ecosystem components including: Oozie, Flume, Nutch, Sqoop and others
  • In-depth look at HDFS and more

April 10, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A MapReduce Tutorial” will cover MapReduce at a deep technical level and will highlight:

  • The history of MapReduce
  • Logical flow of MapReduce
  • Rules and types of MapReduce jobs
  • De-bugging and testing
  • How to write foolproof MapReduce jobs

April 24, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: HBase In-Depth” will provide a deep technical review of HBase and cover:

  • Its flexibility, scalability and components
  • Schema samples
  • Hardware requirements and more

Space is limited so click here to register right away!

Hadoop Console: Simplified Hadoop for the Enterprise

We are pleased to announce the latest release in our string of Big Data announcements: the WANdisco Hadoop Console (WHC.) WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

This innovative Big Data solution offers enterprise users:

  • An S3-enabled HDFS option for securely migrating from Amazon’s public cloud to a private in-house cloud
  • An intuitive UI that makes it easy to install, monitor and manage Hadoop clusters
  • Full support for Amazon S3 features (metadata tagging, data object versioning, snapshots, etc.)
  • The option to implement WHC in either a virtual or physical server environment.
  • Improved server efficiency
  • Full support for HBase

“WANdisco is addressing important issues with this product including the need to simplify Hadoop implementation and management as well as public to private cloud migration,” said John Webster, senior partner at storage research firm Evaluator Group. “Enterprises that may have been on the fence about bringing their cloud applications private can now do so in a way that addresses concerns about both data security and costs.”

More information about WHC is available from the WANdisco Hadoop Console product page. Interested parties can also download our Big Data whitepapers and datasheets, or request a free trial of WHC. Professional support for our Big Data solutions is also available.

This latest Big Data announcement follows the launch of our WANdisco Distro, the world’s first production-ready version of Apache Hadoop 2.

Introduction to Hadoop 2, with a simple tool for generating Hadoop 2 config files

Introduction to Hadoop 2
Core Hadoop 2 consists of the distributed filesystem HDFS and the compute framework YARN.

HDFS is a distributed filesystem that can be used to store anywhere from a few gigabytes to many petabytes of data. It is distributed in the sense that it utilizes a number of slave servers, ranging from 3 to a few thousand, to store and serve files from.

YARN is the compute framework for Hadoop 2. It manages the distribution of compute jobs to the very same slave servers that store the HDFS data. This ensures that the compute jobs do not reach out over the network to access the data stored in HDFS.

Naturally, traditional software written to run on a single server will not work on Hadoop. New software needs to be developed using a special programming paradigm called Map Reduce. Hadoop’s native Map Reduce framework uses java, however Hadoop MR programs can be written in almost any language. Also, higher level languages such as Pig may be used to write scripts that compile into Hadoop MR jobs.

Hadoop 2 Server Daemons
HDFS:  HDFS consists of a master metadata server called the NameNode, and a number of slave servers called DataNodes.

The NameNode function is provided by a single java daemon – the NameNode. The NameNode daemon runs on just one machine – the master. DataNode functionality is provided by a java daemon called DataNode that runs on each slave server.

Since the NameNode function is provided by a single java daemon, it turns out to be Single Point of Failure (SPOF). Open source Hadoop 2 has a number of ways to keep a standby server waiting to take over the function of the NameNode daemon, should the single NameNode fail. All of these standby solutions take 5 to 15 minutes to failover. While this failover is underway, batch MR jobs on the cluster will fail. Further, an active HBase cluster with a high write load may not necessarily survive the NameNode failover. Our company WANdisco has a commercial product called the NonStop NameNode that solves this SPOF problem.

Configuration parameters for the HDFS deamons NameNode and DataNode are all stored in a file called the hdfs-site.xml. At the end of this blog entry, I have included a simple java program that generates all the config files necessary for running a Hadoop 2 cluster. This convenient program generates a hdfs-site.xml.

YARN:
The YARN framework has a single master daemon called the YARN Resource Manager that runs in a master node, and a YARN Node Manager on each of the slave nodes. Additionally, YARN has a single Proxy server and a single Mapreduce job history server. As indicated by the name, the function of the mapreduce job history server is to store and serve a history of the mapreduce jobs that were run on the cluster.

The configuration for all of these daemons is stored in yarn-site.xml.

Daemons that comprise core Hadoop (HDFS and YARN)

Daemon Name Number Description Web Port (if any) RPC Port (if any)
NameNode 1 HDFS Metadata Server, usually run on the master 50070 8020
DataNode 1 per slave HDFS Data Server, one per slave server 50075 50010 (Data transfer RPC),
50020 (Block metadata RPC)
ResourceManager 1 YARN ResourceManager, usually on a master server 8088 8030 (Scheduler RPC),
8031 (Resource Tracker RPC),
8032 (Resource Manager RPC),
8033 (Admin RPC)
NodeManager 1 per slave YARN NodeManager, one per slave 8042 8040 (Localizer),
8041 (NodeManager RPC)
ProxyServer 1 YARN Proxy Server 8034
JobHistory 1 Mapreduce Job History Server 10020 19888

A Simple Program for generating Hadoop 2 Config files
This is a simple program that generates core-site.xml, hdfs-site.xml, yarn-site.xml and capacity-scheduler.xml. You need to supply this program with the following information :

  1. nnHost: HDFS Name Node Server hostname
  2. nnMetadataDir: The directory on the Name Node server’s local filesystem where the NameNode metadata will be stored – nominally /var/hadoop/name. If you are using our WDD rpms, note that all documentation will refer to this directory
  3. dnDataDir: The directory on each DataNode or slave machine where HDFS data blocks are stored. This location must have be the biggest disk on the DataNode. Nominally /var/hadoop/data. If you are using our WDD rpms, note that this is the location that all of our documentation will refer to.
  4. yarnRmHost: YARN Resource Manager hostname

Download the following jar: makeconf
Here is an example run:

$ java -classpath makeconf.jar com.wandisco.hadoop.makeconf.MakeHadoopConf nnHost=hdfs.hadoop.wandisco nnMetadataDir=/var/hadoop/name dnDataDir=/var/hadoop/data yarnRmHost=yarn.hadoop.wandisco

Download the source code for this simple program here: makeconf-source

Incidentally, if you are looking for an easy way to get Hadoop 2, try our free Hadoop distro – the WANdisco Distro (WDD). It is a packaged, tested and certified version of the latest Hadoop 2.

 

avatar

About Jagane Sundar

Answers to questions from the Webinar of Dec 11, 2012

 Download the webinar slides here.

Question 1: Are there any special considerations or support of Spring technologies for this (i.e. Spring-Data, Spring-Integration, Spring-Batch)?

Answer: We are continuously looking at technologies that make Hadoop easier to use and program. Spring-Data, Spring-Integration and Spring-Batch show promise. When sufficient momentum is gathered by these projects, we will work with the Spring community to include a tested version of these technologies in the AltoStor Appliance.

Question 2: What is the hadoop version underneath the appliance?

Answer: Hadoop 2. We intend to remain close to the latest version of Hadoop at any given point in time, modulo fixes and changes for bug fixes.

Question 3: Will the pricing model based on number of name nodes or size of the cluster?

Answer: Pricing decisions have not been made yet. We will announce pricing in the first quarter of 2013.

Question 4: Can you comment on how load balancing is resolved across active nodes? Is there a load balance router concept?

Answer: We do not require/depend on any specialized hardware such as load balancers or NFS filers. By “load balancing” we simply mean that application requests (read or write) can be directed to any NameNode based on its proximity to the client or available resources. Thus NameNodes can share the workload and provide higher overall cluster performance compared to active-standby architecture.

Question 5: How does Active-Active replication impact processing time relative to current Hadoop architectures?

Answer: Active-Active replication will result in load balancing of clients across many NameNodes, i.e. fewer clients will be serviced by each NameNode. Since NameNodes share the workload on a busy cluster, you should expect faster response time for clients. Generally, more active NameNodes can perform a proportionally larger amount of work.


avatar

About Jagane Sundar

Design of the Hadoop HDFS NameNode: Part 1 – Request processing

NameNode Design Part 1 - Client Request Processing

NameNode Design Part 1 – Client Request Processing

HDFS is Hadoop’s File System. It is a distributed file system in that it uses a multitude of machines to implement its functionality. Contrast that with NTFS, FAT32, ext3 etc. which are all single machine filesystems.

HDFS is architected such that the metadata, i.e. the information about file names, directories, permissions, etc. is separated from the user data. HDFS consists of the NameNode, which is HDFS’s metadata server, and DataNodes, where user data is stored. There can be only one active instance of the NameNode. A number of DataNodes (a handful to several thousand) can be part of this HDFS served by the single NameNode.

Here is how a client RPC request to the Hadoop HDFS NameNode flows through the NameNode. This pertains to the Hadoop trunk code base on Dec 2, 2012, i.e. a few months after Hadoop 2.0.2-alpha was released.

The Hadoop NameNode receives requests from HDFS clients in the form of Hadoop RPC requests over a TCP connection. Typical client requests include mkdir, getBlockLocations, create file, etc. Remember – HDFS separates metadata from actual file data, and that the NameNode is the metadata server. Hence, these requests are pure metadata requests – no data transfer is involved. The following diagram traces the path of a HDFS client request through the NameNode. The various thread pools used by the system, locks taken and released by these threads, queues used, etc. are described in detail in this message.

 

  • As shown in the diagram, a Listener object listens to the TCP port serving RPC requests from the client. It accepts new connections from clients, and adds them to the Server object’s connectionList
  • Next, a number of RPC Reader threads read requests from the connections in connectionList, decode the RPC requests, and add them to the rpc call queue – Server.callQueue.
  • Now, the actual worker threads kick in – these are the Handler threads. The threads pick up RPC calls and process them. The processing involves the following:
    • First grab the write lock for the namespace
    • Change the in-memory namespace
    • Write to the in-memory FSEdits log (journal)
  • Now, release the write lock on the namespace. Note that the journal has not been sync’d yet – this means we cannot return success to the RPC client yet
  • Next, each handler thread calls logSync. Upon returning from this call, it is guaranteed that the logfile modification have been sync’d to disk. Exactly how this is guaranteed is messy. Here are the details:
    • Everytime an edit entry is written to the edits log,  a unique txid is assigned for this specific edit. The Handler retrieves this log txid and saves it. This is going to be used to verify whether this specific edit log entry has been sync’d to disk
    • When logSync is called by a Handler, it first checks to see if the last sync’d log edit entry is greater than the txid of the edit log just finished by the Handler. If the Handler’s edit log txid is less than the last sync’d txid, then the Handler can mark the RPC call as complete. If the Handler’s edit log txid is greater than the last sync’d txid, then the Handler has to do one of the following things:
      • It has to grab the sync lock and sync all transcations
      • If it cannot grab the sync lock, then it waits 1000ms and tries again in a loop
      • At this point, the log entry for the transaction made by this Handler has been persisted. The Handler can now mark the RPC as  complete.
    • Now, the single Responder thread picks up completed RPCs and returns the result of the RPC call to to the RPC client. Note that the Responder thread uses NIO to asynchronously send responses back to waiting clients. Hence one thread is sufficient.

There is one thing about this design that bothers me:

  • The threads that wait for their txid to sync sleep 1000ms, wait, sleep 1000ms, wait and continue with this poll. It may make sense to remove the polling mechanism and to replace by an actual sleep/notify mechanism.

That’s all in this writeup, folks.

avatar

About Jagane Sundar

WANdisco’s November Roundup

November has seen several big WANdisco announcements: an acquisition, a patent approval, and two prestigious industry awards. Not content with that, we also brought you our usual mix of software updates, community polls and training programs.

First up, we announced our acquisition of AltoStor, a pioneering firm with deep expertise in the Big Data market.

“The AltoStor acquisition will enable WANdisco to launch products quickly into the highly-lucrative Big Data market,” said WANdisco Chairman and CEO David Richards. “Combining our patented technology with AltoStor’s Big Data platform will provide us with a significant competitive advantage and provide organizations with a more reliable approach for managing big data.”

AltoStor founders Dr. Konstantin Shvachko and Jagane Sundar, who were among the core Apache Hadoop creators, will also be joining the WANdisco team as Chief Architect of Big Data, and Chief Technology Officer and Vice President of Engineering. Welcome to the team!

If you’d like to find out more about Apache Hadoop, Big Data, and how we’ll be applying our own  active-active replication technology to these areas, we’ll be holding a webinar on Tuesday, December 11th at 10am PST (1pm EST.)

WANdisco and Hadoop: The Future of Big Data for the Enterprise’ will  be conducted by David Richards, Dr. Konstantin Shvachko and Jagane Sundar.

This 30 minute webinar will cover:

  • The staggering cross-industry growth of Hadoop in the enterprise.
  • How Hadoop’s limitations, including HDFS’s single-point of failure, are impacting productivity.
  • How WANdisco’s active-active replication technology will alleviate these issues by adding high-availability to Hadoop, taking a fundamentally different approach to Big Data.

It’s free to attend, but places are limited so register now to avoid disappointment.

If you’ve ever taken a look at our enterprise Subversion products (such as MultiSite, the unique replication, mirroring and clustering solution) you’ll already be familiar with our core replication technology. We have now been awarded a US ‘Distributed computing systems and system components thereof’ patent that covers this technology. The patent states that “unlike conventional solutions, the multi-site computing system architecture does not rely on a central transaction coordinator that is known to be a single-point-of-failure.”

“”I am delighted the USPTO has allowed WANdisco’s patent, which we believe to be a breakthrough in the field of data replication,” said David Richards “This patent validates our leadership in distributed computing and our belief that no other company can achieve active-active replication over a Wide Area Network.”

Remember, that if you want to try this unique, patented technology for yourselves, you can currently claim a free 15 day trial of Subversion MultiSite.

Continuing with the enterprise announcements, we launched a new enterprise-class, on-demand Subversion training program. We know it’s not always easy to set aside time for enterprise training, so we’ve made it even easier to get the training you need to get the most out of Subversion.

Designed with the enterprise in mind, our step-by-step video modules cover all the crucial SVN topics, including:

  • Basic Operations and Command Line
  • Handling Merge Conflicts
  • Advanced Repository Management
  • SVN Changelists
  • Hook Scripts

Sign up before December 15th, to claim one month of Subversion eTraining for free, with a 6 month agreement, or 3 months free with a 12 month agreement. You can view the complete course list online, or request a quote.

The Apache Bloodhound team recently announced the first release of this new project, and now we’re pleased to see that the second release of Apache Bloodhound has arrived. Apache Bloodhound 0.2 (Incubating) upgrades to Bootstrap version 2.1, and fixes various issues from the 0.1 release, alongside various bug fixes relating to the new user interface.

Although WANdisco are sponsoring some of the initial committers, one of the Apache Bloodhound project’s core goals is to create a strong developer community around the Trac code base in a vendor-neutral location. If you’re interested in participating in the Apache Bloodhound project, you can learn more at the ‘Getting Involved With Apache Bloodhound’ page.

We were also pleased to announce the release of SmartSVN 7.5. SmartSVN 7.5 is a major step forward for the client, bringing the following benefits to the SmartSVN community:

Streamlined and Simplified, A Better UX

  •  New GUI library (SWT) to provide native look and improved responsiveness
  • A clean and compact branching structure view with an Enhanced Revision Graph
  • See file statuses at a glance
  • No longer forced to create a repository profile (profiles can still be used if preferred)

Little Things, Made Easier

  • Edit properties directly in the Repository Browser
  • Export smaller HTML graphic files with the Export option of Revision Graph
  • Remove, Move, and Copy now operate on multiple selected directories
  • Default, unchanged files are not shown. (To find files, use the File Filter)

More Secure

  • The freedom to work offline. When offline, the remote state and transactions are not refreshed automatically – Log and Revision Graph operate from data already stored in the log cache
  • Support for safe password storage with the Plugin-API

Not yet tried SmartSVN? You can claim your 30 day free trial of SmartSVN Professional now.

If you’re a SmartSVN user – or someone interested in getting started with this popular, cross-platform client – we need your input to design our new SmartSVN eTraining modules.

What topics would you like to see in these sessions? Please take a moment to complete our quick poll on what you’d like us to cover in SmartSVN eTraining. Or, if you’d like to see a subject that’s not listed in the poll, please feel free to Contact Us directly.

Finally, we’d like to thank both techMark and Shares Magazine for the two awards we were presented with this month. The first award was presented at the techMark 2012 awards dinner in London in the Emerging Star category, and the second award was at the Shares Awards 2012, in the Best IPO category.

“Our readers who vote for the Shares Award in the Best Initial Public Offering/Fund Raising of the Year category clearly know a good market newcomer when they see one. Prior winners such as Wellstream, Kentz and Waterlogic all have a formidable pedigree and I am delighted to welcome WANdisco to this elite club of winners,” says Russ Mould, editor of Shares Magazine.