Tag Archive for 'Big Data'

Big Data ETL Across Multiple Data Centers

Scientific applications, weather forecasting, click-stream analysis, web crawling, and social networking applications often have several distributed data sources, i.e., big data is collected in separate data center locations or even across the Internet.

In these cases, the most efficient architecture for running extract, transform, load (ETL) jobs over the entire data set becomes nontrivial.

Hadoop provides the Hadoop Distributed File System (HDFS) for storage and YARN (Yet Another Resource Negotiator) as the programming model in Hadoop 2.0. ETL jobs use the MapReduce programming model to run on the YARN framework.

Though these are adequate for a single data center, there is a clear need to enhance them for multi-data center environments. In these instances, it is important to provide active-active redundancy for YARN and HDFS across data centers. Here’s why:

1. Bringing compute to data

Hadoop’s architectural advantage lies in bringing compute to data. Providing active-active (global) YARN accomplishes that on top of global HDFS across data centers.

2. Minimizing traffic on a WAN link

There are three types of data analytics schemes:

a) High-throughput analytics where the output data of a MapReduce job is small compared to the input.

Examples include weblogs, word count, etc.

b) Zero-throughput analytics where the output data of a MapReduce job is equal to the input. A sort operation is a good example of a job of this type.

c) Balloon-throughput analytics where the output is much larger than the input.

Local YARN can crunch the data and use global HDFS to redistribute for high throughput analytics. Keep in mind that this might require another MapReduce job running on the output results, however, which can add traffic to the WAN link. Global YARN mitigates this even further by distributing the computational load.

Last but not least, fault tolerance is required at the server, rack, and data center levels. Passive redundancy solutions can cause days of downtime before resuming. Active-active redundant YARN and HDFS provide zero-downtime solutions for MapReduce jobs and data.

To summarize, it is imperative for mission-critical applications to have active-active redundancy for HDFS and YARN. Not only does this protect data and prevent downtime, but it also allows big data to be processed at an accelerated rate by taking advantage of the aggregated CPU, network and storage of all servers across datacenters.

– Gurumurthy Yeleswarapu, Director of Engineering, WANdisco

Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

Free Webinar: Enterprise-Enabling Hadoop for the Data Center

We’re pleased to announce that WANdisco will be co-hosting a free Apache Hadoop webinar with Tony Baer, Ovum’s lead Big Data analyst. Ovum is an independent analyst and consultancy firm specializing in the IT and telecommunications industries.

This webinar, ‘Big Data – Enterprise-Enabling Hadoop for the Data Center’, will cover the key issues of availability, performance and scalability and how Apache Hadoop is evolving to meet these requirements.

“This webinar will discuss the importance of availability, performance and scalability,” said Ovum’s Tony Baer. “Ovum believes that for Hadoop to become successfully adopted in the enterprise, that it must become a first class citizen with IT and the data center. Availability, performance and scalability are key issues, and also where there is significant innovation occurring. We’ll discuss how the Hadoop platform is evolving to meet these requirements.”

Topics include:

  • How Hadoop is becoming a first class, enterprise-hardened technology for the data center
  • Hadoop components and the role of reliability and performance in those components

  • Disaster recovery challenges faced by globally distributed organizations and how replication technology is crucial to business continuity

  • The importance of seamless Hadoop migration from the public cloud to private clouds, especially for organizations that require secure 24/7 access with real-time performance

Big Data – Enterprise-Enabling Hadoop for the Data Center’ will be held on Tuesday, April 30th at 10:00 am Pacific / 1:00 pm Eastern. Register for this free webinar here.

WANdisco’s March Roundup

Following the recent issuance of our “Distributed computing systems and system components thereof” patent, which cover the fundamentals of active-active replication over a Wide Area Network, we’re excited to announce the filing of three more patents. These patents involve methods, devices and systems that enhance security, reliability, flexibility and efficiency in the field of distributed computing and will have significant benefits for users of our Hadoop Big Data product line.

“Our team continues to break new ground in the field of distributed computing technology,” said David Richards, CEO for WANdisco. “We are proud to have some of the world’s most talented engineers in this field working for us and look forward to the eventual approval of these most recent patent applications. We are particularly excited about their application in our new Big Data product line.”

Our Big Data product line includes Non-Stop NameNode, WANdisco Hadoop Console and WANdisco Distro (WDD.)

This month, we also welcomed Bas Nijjer, who built CollabNet UK from startup to multimillion dollar recurring revenue, to the WANdisco team. Bas Nijjer has a proven track record of increasing customer wins, accelerating revenue and providing customer satisfaction, and he takes on the role of WANdisco Sales Director, EMEA.

“Bas is an excellent addition to our team, with great insight on developing and strengthening sales teams and customer relationships as well as enterprise software,” said David Richards. “His expertise and familiarity with EMEA and his results-oriented attitude will help strengthen the WANdisco team and increase sales and renewals. We are pleased to have him join us.”

If joining the WANdisco team interests you, visit our Careers page for all the latest employment opportunities.

We’ve also posted lots of new content at the WANdisco blog. Users of SmartSVN, our cross-platform graphical Subversion client, can find out how to get even more out of their installation with our ‘Performing a Reverse Merge in SmartSVN’ and ‘Backing Up Your SmartSVN Data’ tutorials. For users running the latest and greatest, 7.5.4 release of SmartSVN, we’ve put together a deep dive into the fixes and new functionality in this release with our ‘What’s New in SmartSVN 7.5.4?’ post. If you haven’t tried SmartSVN yet, you can claim your free trial of this release by visiting http://smartsvn.com/download

We also have a new post from James Creasy, WANdisco’s Senior Director of Product Management, where he takes a closer look at the “WAN” in “WANdisco:”

“We’ve all heard about the globalization of the world economy. Every globally relevant company is now highly dependent on highly available software, and that software needs to be equally global. However, most systems that these companies rely on were architected with a single machine in mind. These machines were accessed over a LAN (local area network) by mostly co-located teams.

All that changed, starting in the 1990’s with widespread adoption of outsourcing. The WAN computing revolution had begun in earnest.”

You can read “What’s in a name, WANdisco?” in full now.

Also at the blog we address the hot topic of ‘Is Subversion Ready for the Enterprise?’ And, if you need more information on the challenges and available solutions for deploying Subversion in an enterprise environment, be sure to sign up for our free-to-attend ‘Scaling Subversion for the Enterprise’ sessions. Taking place a few times a week, these webinars cover limitations and risks related to globally distributed SVN deployments, as well as free resources and live demos to help you overcome them. Take advantage of the opportunity to get answers to your business-specific questions and live demos of enterprise-class SVN products.

WANdisco Files Three New Patents with USPTO

We are pleased to announce the filing of three new patents with the United States Patent and Trademark Office (USPTO) related to distributed computing.

These three innovations involve methods, devices and systems that enhance security, reliability, flexibility and efficiency in the field of distributed computing. The patents are expected to have significant benefits for users of our new Hadoop Big Data product line.

Our team continues to break new ground in the field of distributed computing technology,” said David Richards, CEO for WANdisco. “We are proud to have some of the world’s most talented engineers in this field working for us and look forward to the eventual approval of these most recent patent applications. We are particularly excited about their application in our new Big Data product line.”

Our Big Data product line includes Non-Stop NameNode, which turns the NameNode into an active-active shared-nothing cluster, and the comprehensive wizard-driven management dashboard ‘WANdisco Hadoop Console.’ We also offer a free-to-download, fully-tested and production-ready version of Apache Hadoop 2. Visit the WANdisco Distro (WDD) to learn more.

This news comes after we announced the issuance of our “Distributed computing systems and system components thereof” patent, which covers the fundamentals of active-active replication over a Wide Area Network.

 

WANdisco’s February Roundup

This month, we launched a trio of innovative Hadoop products: the world’s first production-ready distro; a wizard-driven management dashboard; and the first and only 100% uptime solution for Apache Hadoop.

hadoop big data

We started this string of Big Data announcements with WANdisco Distro (WDD) a fully tested, free-to-download version of Apache Hadoop 2. WDD is based on the most recent Hadoop release, includes all the latest fixes and undergoes the same rigorous quality assurance process as our enterprise software solutions.

This release paved the way for our enterprise Hadoop solutions, and we announced the WANdisco Hadoop Console (WHC) shortly after. WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

The final product in this month’s Big Data announcements was WANdisco Non-Stop NameNode. Our patented technology makes WANdisco Non-Stop Namenode the first and only 100% uptime solution for Hadoop, and offers a string of benefits for enterprise users:

  • Automatic failover and recovery
  • Automatic continuous hot backup
  • Removes single point of failure
  • Eliminates downtime and data loss
  • Every NameNode server is active and supports simultaneous read and write requests
  • Full support for HBase

To support the needs of the Apache Hadoop community, we’ve also launched a dedicated Hadoop forum. At this forum, users can get advice on their Hadoop installation and connect with fellow users, including WANdisco’s core Apache Hadoop developers Dr. Konstantin V. Shvachko, Dr. Konstantin Boudnik, and Jagane Sundar.

subversion

For Apache Subversion users, we announced the next webinars in our free training series:

  • Subversion Administration – everything you need to administer a Subversion development environment
  • Introduction to SmartSVN – a short introduction to how Subversion works with the SmartSVN graphical client
  • Checkout Command – how to get the most out of the checkout command, and the meaning of the various error messages you may encounter
  • Commit Command – learn more about this command, including diff usage, working with unversioned files and changelists
  • Introduction to Git – everything a new user needs to get started with Git
  • Hook Scripts – how to use hook scripts to automate tasks such as email notifications, backups and access control
  • Advanced Hook Scripts – an advanced look at hook scripts, including using a config file with hook scripts and passing data to hook scripts

We’ve announced an ongoing series of free webinars, which demonstrate how you can overcome these challenges from an administrative, business and IT perspective, and get the most out of deploying Subversion in an enterprise environment. These ‘Scaling Subversion for the Enterprise’ webinars will be conducted by our expert Solution Architect three times a week (Tuesday, Wednesday and Thursday) at 10.00am PST/1.00pm EST, and will cover:

  • The latest technology that can help you overcome the limitations and risks associated with globally distributed deployments
  • Answers to your business-specific questions
  • How to solve critical issues
  • The free resources and offers that can help solve your business challenges

WANdisco Announces Free Online Hadoop Training Webinars

We’re excited to announce a series of free one-hour online Hadoop training webinars, starting with four sessions in March and April. Time will be allowed for audience Q&A at the end of each session.

Wednesday, March 13 at 10:00 AM Pacific, 1:00 PM Eastern

A Hadoop Overview” will cover Hadoop, from its history to its architecture as well as:

  • HDFS, MapReduce, and HBase
  • Public and private cloud deployment options
  • Highlights of common business use cases and more

March 27, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A Deep Dive” covers Hadoop misconceptions (not all clusters include thousands of machines) and:

  • Real world Hadoop deployments
  • Review of major Hadoop ecosystem components including: Oozie, Flume, Nutch, Sqoop and others
  • In-depth look at HDFS and more

April 10, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A MapReduce Tutorial” will cover MapReduce at a deep technical level and will highlight:

  • The history of MapReduce
  • Logical flow of MapReduce
  • Rules and types of MapReduce jobs
  • De-bugging and testing
  • How to write foolproof MapReduce jobs

April 24, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: HBase In-Depth” will provide a deep technical review of HBase and cover:

  • Its flexibility, scalability and components
  • Schema samples
  • Hardware requirements and more

Space is limited so click here to register right away!

WANdisco Non-Stop NameNode Removes Hadoop’s Single Point of Failure

We’re pleased to announce the release of the WANdisco Non-Stop NameNode, the only 100% uptime solution for Apache Hadoop. Built on our Non-Stop patented technology, Hadoop’s NameNode is no longer a single point of failure, delivering immediate and automatic failover and recovery whenever a server goes offline, without any downtime or data loss.

“This announcement demonstrates our commitment to enterprises looking to deploy Hadoop in their production environments today,” said David Richards, President and CEO of WANdisco. “If the NameNode is unavailable, the Hadoop cluster goes down. With other solutions, a single NameNode server actively supports client requests and complex procedures are required if a failure occurs. The Non-Stop NameNode eliminates those issues and also allows for planned maintenance without downtime. WANdisco provides 100% uptime with unmatched scalability and performance.”

Additional benefits of Non-Stop NameNode include:

  • Every NameNode server is active and supports simultaneous read and write requests.
  • All servers are continuously synchronized.
  • Automatic continuous hot backup.
  • Immediate and automatic recovery after planned or unplanned outages, without the need for administrator intervention.
  • Protection from “split-brain” where the backup server becomes active before the active server is completely offline. This can result in data corruption.
  • Full support for HBase.
  • Works with Apache Hadoop 2.0 and CDH 4.1.

“Hadoop was not originally developed to support real-time, mission critical applications, and thus its inherent single point of failure was not a major issue of concern,” said Jeff Kelly, Big Data Analyst at Wikibon. “But as Hadoop gains mainstream adoption, traditional enterprises rightly are looking to Hadoop to support both batch analytics and mission critical apps. With WANdisco’s unique Non-Stop NameNode approach, enterprises can feel confident that mission critical applications running on Hadoop, and specifically HBase, are not at risk of data loss due to a NameNode failure because, in fact, there is no single NameNode. This is a major step forward for Hadoop.”

You can learn more about the Non-Stop NameNode at the product page, where you can also claim your free trial.

If you’d like to get first-hand experience of the Non-Stop NameNode and are attending the Strata Conference in Santa Clara this week, you can find us at booth 317, where members of the WANdisco team will be doing live demos of Non-Stop NameNode throughout the event.

WANdisco Announces Non-Stop Hadoop Alliance Partner Program

We’re pleased to announce the launch of our Non-Stop Alliance Partner Program to provide Industry, Technology and Strategic Partners with the competitive advantage required to compete and win in the multi-billion dollar Big Data market.

There are three partner categories:

  • For Industry Partners, which include consultants, system integrators and VARs, the program provides access to customers who are ready to deploy and the competitive advantage necessary to grow business through referral and resale tracks.
  • For Technology and Strategic Partners, including software and hardware vendors, the program accelerates time-to-market through Non-Stop certification and reference-integrated solutions.
  • For Strategic Partners, the program offers access to WANdisco’s non-stop technology for integrated Hadoop solutions (OEM and MSP)

Founding Partners participating in the Non-Stop Alliance Partner Program include Hyve Solutions and SUSE.

“Hyve Solutions is excited to be a founding member of WANdisco’s Non-Stop Alliance Partner Program,” said Steve Ichinaga, Senior Vice President and General Manager of Hyve Solutions. “The integration of WANdisco and SUSE’s technology with Hyve Solutions storage and server platforms gives enterprise companies an ideal way to deploy Big Data environments with non-stop uptime quickly and effectively into their datacenters.”

“Linux is the undisputed operating system of choice for high performance computing. For two decades, SUSE has provided reliable, interoperable Linux and cloud infrastructure solutions to help top global organizations achieve maximum performance and scalability,” said Michael Miller, vice president of global alliances and marketing, SUSE.  “We’re delighted to be a Non-Stop Strategic Technology Founding Partner to deliver highly available Hadoop solutions to organizations looking to solve business challenges with emerging data technologies.”

Find out more about joining the WANdisco Non-Stop Alliance Partner Program or view our full list of partners.

Hadoop Console: Simplified Hadoop for the Enterprise

We are pleased to announce the latest release in our string of Big Data announcements: the WANdisco Hadoop Console (WHC.) WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

This innovative Big Data solution offers enterprise users:

  • An S3-enabled HDFS option for securely migrating from Amazon’s public cloud to a private in-house cloud
  • An intuitive UI that makes it easy to install, monitor and manage Hadoop clusters
  • Full support for Amazon S3 features (metadata tagging, data object versioning, snapshots, etc.)
  • The option to implement WHC in either a virtual or physical server environment.
  • Improved server efficiency
  • Full support for HBase

“WANdisco is addressing important issues with this product including the need to simplify Hadoop implementation and management as well as public to private cloud migration,” said John Webster, senior partner at storage research firm Evaluator Group. “Enterprises that may have been on the fence about bringing their cloud applications private can now do so in a way that addresses concerns about both data security and costs.”

More information about WHC is available from the WANdisco Hadoop Console product page. Interested parties can also download our Big Data whitepapers and datasheets, or request a free trial of WHC. Professional support for our Big Data solutions is also available.

This latest Big Data announcement follows the launch of our WANdisco Distro, the world’s first production-ready version of Apache Hadoop 2.