AltoStor Blog

Continuous Availability versus High Availability

Wikipedia’s page on Continuous Availability is available here:

http://en.wikipedia.org/wiki/Continuous_availability

A quick perusal tells us that High Availability can be ‘accomplished by providing redundancy or quickly restarting failed components’. This is very different from ‘Continuously Available’ systems that enable continuous operation through planned and unplanned outages of one or more components.

As large global organizations move from using Hadoop for batch storage and retrieval to mission critical real-time applications where the cost of even one minute of downtime is unacceptable, mere high availability will not be enough.

Solutions such as HDFS NameNode High Availability (NameNode HA) that come with Apache Hadoop 2.0 and Hadoop distributions based on it are subject to downtimes of 5 to 15 minutes.  In addition, NameNode HA, is limited to a single data center, and only one NameNode can be active at a time, creating a performance as well as an availability bottleneck. Deployments that incorporate WANdisco Non-Stop Hadoop are not subject to any downtime, regardless of whether a single NameNode server or an entire data center goes offline. There is no need for maintenance windows with Non-Stop Hadoop, since you can simply bring down the NameNode servers one at a time, and perform your maintenance operations.  The remaining active NameNodes continue to support real-time client applications as well as batch jobs.

The business advantages of a Continuously Available, multi-data center aware systems are well known to IT decision makers. Here are some examples that illustrate how both real-time and batch applications can benefit and new use cases can be supported:

  • A Batch Big Data DAG is a chain of applications wherein the output of a preceding job is used as the input to a subsequent job. At companies such as Yahoo, these DAGs take six to eight hours to run, and they are run every day. Fifteen minutes of NameNode downtime may cause one of these jobs to fail. As a result of this single failure, the entire DAG may not run to completion, creating delays that can last many hours.
  • Global clickstream analysis applications that enable businesses to see and respond to customer behavior or detect potentially fraudulent activity in real-time.
  • A web site or service built to use HBase as a backing store will be down if the HDFS underlying HBase goes down when the NameNode fails. This is likely to result in lost revenue and erode customer goodwill.  Non-Stop Hadoop eliminates this risk.
  • Continuous Availability systems such as  WANdisco Non-Stop Hadoop are administered with  fewer staff. This is because failure of one out of five NameNodes is not an emergency event. It can be dealt with by staff during regular business hours. Significant cost savings in staffing can be achieved since Continuously Available systems do not require 24×7 sysadmin staff .  In addition, in a distributed multi-data center environment, Non-Stop Hadoop can be managed from one location.
  • There are no passive or standby servers or data centers that essentially sit idle until disaster strikes.  All servers are active and provide full read and write access to the same data at every location.

See a demo of Non-Stop Hadoop for Cloudera and Non-Stop Hadoop for Hortonworks in action and read what leading industry analysts like Gartner’s Merv Adrian have to say about the need for continuous Hadoop availability.

 

avatar

About Jagane Sundar

Running the SLive Test on WANdisco Distro

The SLive test is a stress test designed to simulate distributed operations and load on the NameNode by utilizing the MapReduce paradigm. It was designed by Konstantin Shvachko and introduced into the Apache Hadoop project in 2010 by him and others. It is now one of the many stress tests we ran here at WANdisco in testing our distribution, WANdisco Distro (WDD).

You can read the original paper about how this test works here:
https://issues.apache.org/jira/secure/attachment/12448004/SLiveTest.pdf
You can view the associated Apache JIRA for the introduction of this test here:
https://issues.apache.org/jira/browse/HDFS-708

This blog will provide a short tutorial on how you can run the SLive test on your own cluster of Hadoop 2 and YARN / MapReduce. Before we begin, please make sure you are logged in as the ‘hdfs’ user:

su – hdfs

The first order of business is to become familiar with the parameters supported by the stress test.

The percentage of operation distribution parameters:
-create <num> -delete <num> -rename <num> -read <num>  -append <num> -ls <num> -mkdir <num>

Stress test property parameters:
-blockSize <min,max> -readSize <min,max> -writeSize <min,max> -files <total>

The first set of parameters controls “how many of this kind of operation do you want?”. For example, if you want to simulate just a create and delete scenario, with no reads or writes, you would run the test with -create 50 -delete 50 (or any other percentages that add up to 100) and set the others in that first set to 0, or just don’t specify them and the test will automatically set them to 0.

The second set of parameters controls properties that extend throughout the entire test. “How many files do you want to make?,” “What is the biggest and smallest that you want each block in the file to be?” They can be ignored for the most part, except for “-blockSize”. Using the default block size, which is 64 megabytes, may cause your run of the SLive test to take longer. In order to make this a speedy tutorial, we will use small block sizes. Please note that block sizes must be in multiples of 512 bytes. We will use 4096 bytes in this tutorial.

There are other parameters available, but they are not necessary in order to provide a basic understanding and run of this stress test. You can refer to the document at the top of this entry if your curiosity of the other parameters is getting the best of you, or you can run:

hadoop org.apache.hadoop.fs.slive.SliveTest –help

The second step is to understand how to run the test. Although it is advised NOT to do this just yet, you can make the following call to instantly run the test with default parameters by executing the following command:

hadoop org.apache.hadoop.fs.slive.SliveTest

However, since we have no initial data within the cluster, you should notice that most, if not all, of the operations in the report are failures. Run the following to initialize the cluster with 10,000 files, all with a tiny 4096 byte block size, in order to achieve a quick run of the SLive test:

hadoop org.apache.hadoop.fs.slive.SliveTest -create 100 -delete 0 -rename 0 -read 0 -append 0 -ls 0 -mkdir 0 -blockSize 4096,4096 -files 10000

On a cluster with 1 NameNode and 3 DataNodes, running this command should take no longer  than about 3 – 4 minutes. If it is taking too long, you can try re-running with a lower “-files” parameter number and/or a smaller “-blockSize” parameter as well.

After you have initialized the cluster with data, you will need to delete the output directory that your previous SLive test run had created:

hadoop fs -rmr /test/slive/slive/output

You will need to do this after every time you have run an SLive test; otherwise your next run attempt will fail, telling you that the output directory for your requested run already exists.

You can now run the default test, which performs an equal distribution of creates, deletes, reads, and other operations across the cluster:

hadoop org.apache.hadoop.fs.slive.SliveTest

Or you can specify the parameters of your own choosing and customize your own load to stress test with! That is the purpose of the test, after all. Enjoy!

Here are the results obtained from our own in-house run of the SLive test for you to compare your own results with. I ran the following command after initialization:

hadoop org.apache.hadoop.fs.slive.SliveTest -blockSize 4096,4096 -files 10000

And I got the following results:

13/02/11 11:00:36 INFO slive.SliveTest: Reporting on job:
13/02/11 11:00:36 INFO slive.SliveTest: Writing report using contents of /test/slive/slive/output
13/02/11 11:00:36 INFO slive.SliveTest: Report results being placed to logging output and to file /home/hdfs/part-0000
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type AppendOp
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “bytes_written” = 4317184
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “failures” = 1
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “files_not_found” = 365
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 59813
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “successes” = 1054
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “bytes_written” = 0.067 MB/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 23.741 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “successes” = 17.622 successes/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type CreateOp
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “bytes_written” = 1490944
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “failures” = 1056
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 19029
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “successes” = 364
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “bytes_written” = 0.053 MB/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 74.623 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “successes” = 19.129 successes/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type DeleteOp
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “failures” = 365
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 4905
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “successes” = 1055
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 289.501 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “successes” = 215.087 successes/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type ListOp
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “dir_entries” = 1167
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “files_not_found” = 1145
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 536
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “successes” = 275
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “dir_entries” = 2177.239 directory entries/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 2649.254 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “successes” = 513.06 successes/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type MkdirOp
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 5631
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “successes” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 252.175 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “successes” = 252.175 successes/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type ReadOp
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “bad_files” = 1
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “bytes_read” = 25437917184
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “chunks_unverified” = 0
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “chunks_verified” = 3188125200
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “files_not_found” = 342
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 268754
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “successes” = 1077
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “bytes_read” = 90.265 MB/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 5.284 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “successes” = 4.007 successes/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type RenameOp
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “failures” = 1165
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 1130
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 1420
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “successes” = 255
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 1256.637 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “successes” = 225.664 successes/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Basic report for operation type SliveMapper
13/02/11 11:00:36 INFO slive.ReportWriter: ————-
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “milliseconds_taken” = 765862
13/02/11 11:00:36 INFO slive.ReportWriter: Measurement “op_count” = 9940
13/02/11 11:00:36 INFO slive.ReportWriter: Rate for measurement “op_count” = 12.979 operations/sec
13/02/11 11:00:36 INFO slive.ReportWriter: ————-

avatar

About Plamen Jeliazkov

Introduction to Hadoop 2, with a simple tool for generating Hadoop 2 config files

Introduction to Hadoop 2
Core Hadoop 2 consists of the distributed filesystem HDFS and the compute framework YARN.

HDFS is a distributed filesystem that can be used to store anywhere from a few gigabytes to many petabytes of data. It is distributed in the sense that it utilizes a number of slave servers, ranging from 3 to a few thousand, to store and serve files from.

YARN is the compute framework for Hadoop 2. It manages the distribution of compute jobs to the very same slave servers that store the HDFS data. This ensures that the compute jobs do not reach out over the network to access the data stored in HDFS.

Naturally, traditional software written to run on a single server will not work on Hadoop. New software needs to be developed using a special programming paradigm called Map Reduce. Hadoop’s native Map Reduce framework uses java, however Hadoop MR programs can be written in almost any language. Also, higher level languages such as Pig may be used to write scripts that compile into Hadoop MR jobs.

Hadoop 2 Server Daemons
HDFS:  HDFS consists of a master metadata server called the NameNode, and a number of slave servers called DataNodes.

The NameNode function is provided by a single java daemon – the NameNode. The NameNode daemon runs on just one machine – the master. DataNode functionality is provided by a java daemon called DataNode that runs on each slave server.

Since the NameNode function is provided by a single java daemon, it turns out to be Single Point of Failure (SPOF). Open source Hadoop 2 has a number of ways to keep a standby server waiting to take over the function of the NameNode daemon, should the single NameNode fail. All of these standby solutions take 5 to 15 minutes to failover. While this failover is underway, batch MR jobs on the cluster will fail. Further, an active HBase cluster with a high write load may not necessarily survive the NameNode failover. Our company WANdisco has a commercial product called the NonStop NameNode that solves this SPOF problem.

Configuration parameters for the HDFS deamons NameNode and DataNode are all stored in a file called the hdfs-site.xml. At the end of this blog entry, I have included a simple java program that generates all the config files necessary for running a Hadoop 2 cluster. This convenient program generates a hdfs-site.xml.

YARN:
The YARN framework has a single master daemon called the YARN Resource Manager that runs in a master node, and a YARN Node Manager on each of the slave nodes. Additionally, YARN has a single Proxy server and a single Mapreduce job history server. As indicated by the name, the function of the mapreduce job history server is to store and serve a history of the mapreduce jobs that were run on the cluster.

The configuration for all of these daemons is stored in yarn-site.xml.

Daemons that comprise core Hadoop (HDFS and YARN)

Daemon Name Number Description Web Port (if any) RPC Port (if any)
NameNode 1 HDFS Metadata Server, usually run on the master 50070 8020
DataNode 1 per slave HDFS Data Server, one per slave server 50075 50010 (Data transfer RPC),
50020 (Block metadata RPC)
ResourceManager 1 YARN ResourceManager, usually on a master server 8088 8030 (Scheduler RPC),
8031 (Resource Tracker RPC),
8032 (Resource Manager RPC),
8033 (Admin RPC)
NodeManager 1 per slave YARN NodeManager, one per slave 8042 8040 (Localizer),
8041 (NodeManager RPC)
ProxyServer 1 YARN Proxy Server 8034
JobHistory 1 Mapreduce Job History Server 10020 19888

A Simple Program for generating Hadoop 2 Config files
This is a simple program that generates core-site.xml, hdfs-site.xml, yarn-site.xml and capacity-scheduler.xml. You need to supply this program with the following information :

  1. nnHost: HDFS Name Node Server hostname
  2. nnMetadataDir: The directory on the Name Node server’s local filesystem where the NameNode metadata will be stored – nominally /var/hadoop/name. If you are using our WDD rpms, note that all documentation will refer to this directory
  3. dnDataDir: The directory on each DataNode or slave machine where HDFS data blocks are stored. This location must have be the biggest disk on the DataNode. Nominally /var/hadoop/data. If you are using our WDD rpms, note that this is the location that all of our documentation will refer to.
  4. yarnRmHost: YARN Resource Manager hostname

Download the following jar: makeconf
Here is an example run:

$ java -classpath makeconf.jar com.wandisco.hadoop.makeconf.MakeHadoopConf nnHost=hdfs.hadoop.wandisco nnMetadataDir=/var/hadoop/name dnDataDir=/var/hadoop/data yarnRmHost=yarn.hadoop.wandisco

Download the source code for this simple program here: makeconf-source

Incidentally, if you are looking for an easy way to get Hadoop 2, try our free Hadoop distro – the WANdisco Distro (WDD). It is a packaged, tested and certified version of the latest Hadoop 2.

 

avatar

About Jagane Sundar

WANdisco’s January Roundup

Happy new year from WANdisco!

This month we have plenty of news related to our move into the exciting world of Apache Hadoop. Not only did another veteran Hadoop developer join our ever-expanding team of experts, but we announced a partnership with Cloudera, and WANdisco CEO David Richards and Vice President of Big Data Jagane Sundar met with Wikibon’s lead analyst for an in-depth discussion on active-active big data deployments.

WANdisco big data

You may have heard that AltoStor founders and core Apache Hadoop creators, Dr. Konstantin Shvachko and Jagane Sundar joined WANdisco last year. Now we’re excited to announce that another veteran Hadoop developer has joined our Big Data team. Dr Konstantin Boudnik is the founder of Apache BigTop and was a member of the original Hadoop development team. Dr. Boudnik will act as WANdisco’s Director of Big Data Distribution, leading WANdisco’s Big Data team in the rollout of certified Hadoop binaries and graphical user interface. Dr. Boudnik will ensure quality control and stability of the Hadoop open source code.

In building our Big Data team, we’ve been seeking Hadoop visionaries and authorities who demonstrate leadership and originality,” said David Richards, CEO of WANdisco. “Konstantin Boudnik clearly fits that description, and we’re honored that he’s chosen to join our team. He brings great professionalism and distribution expertise to WANdisco.”

Also on the Big Data-front, CEO David Richards, and Vice President of Big Data Jagane Sundar, spoke to Wikibon’s lead analyst about our upcoming solution for active-active big data deployments.

We can take our secret sauce, which is this patented active-active replication algorithm, and apply it to Hadoop to make it bullet-proof for enterprise deployments,” said David Richards. “We have something coming out called the Non-Stop NameNode … that will ensure that Hadoop stays up 100% of the time, guaranteed.”

Watch the ‘WANdisco Hardening Hadoop for the Enterprise’ video in full, or read Wikibon’s Lead Big Data Analyst Jeff Kelly’s post about the upcoming Non-Stop NameNode.

Capping off our Big Data announcements, WANdisco is now an authorized member of the Cloudera Connect Partner Program. This program focuses on accelerating the innovative use of Apache Hadoop for a range of business applications.

We are pleased to welcome WANdisco into the Cloudera Connect network of valued service and solution providers for Apache Hadoop and look forward to working together to bring the power of Big Data to more enterprises,” said Tim Stevens, Vice President of Business and Corporate Development at Cloudera. “As a trusted partner, we will equip WANdisco with the tools and resources necessary to support, manage and innovate with Apache Hadoop-based solutions.”

As a member of Cloudera Connect, we are proud to add Cloudera’s extensive tools, use case insight and resources to the expertise of our core Hadoop committers.

You can learn more about this program at Cloudera’s website and by reading the official announcement in full.

apache subversion logo

On the Subversion side of things, the SVN community announced their first release of 2013, with an update to the Subversion 1.6 series.

Apache Subversion 1.6.20 includes some useful fixes for 1.6.x users:

  • Vary: header added to GET responses
  • Fix fs_fs to cleanup after failed rep transmission.
  • A fix for an assert with SVNAutoVersioning in mod_dav_svn

Full details on Apache Subversion 1.6.20 can be found in the Changes file. As always, the latest, certified binaries can be downloaded for free from our website, along with the latest release of the Subversion 1.7 series.

How many developers can a single Apache Subversion server support? In his recent blog post, James Creasy discussed how DConE replication technology can support Subversion deployments of 20,000 or more developers.

“While impressive, DConE is not magic,” writes James. “What DConE delivers is a completely fault tolerant, mathematically ideal coordination engine for performing WAN connected replication.”

In another new DConeE post, James explains where DConE fits into the ‘software engineering vs. computer science’ debate, and warns “in the world of distributed computing, you’d better come armed with deep knowledge of the science.”

Finally, WANdisco China, a Wholly Foreign Owned Enterprise was announced this month, following WANdisco’s first deal in China with major telecommunications equipment company Huawei. From this new office we’ll be providing sales, training, consulting and 24/7 customer support for WANdisco software solutions sold in China, and are excited to be expanding our activities within this region.

We view China as an emerging and high growth market for WANdisco,” said David Richards. “It was a natural progression to establish our Chengdu office as a WFOE and ramp up staff there as so many companies have operations in the country. We are excited about this announcement and look forward to the growth opportunities this brings.”

To keep up with all the latest WANdisco news, be sure to follow us on Twitter.

 

Design of the Hadoop HDFS NameNode: Part 1 – Request processing

NameNode Design Part 1 - Client Request Processing

NameNode Design Part 1 – Client Request Processing

HDFS is Hadoop’s File System. It is a distributed file system in that it uses a multitude of machines to implement its functionality. Contrast that with NTFS, FAT32, ext3 etc. which are all single machine filesystems.

HDFS is architected such that the metadata, i.e. the information about file names, directories, permissions, etc. is separated from the user data. HDFS consists of the NameNode, which is HDFS’s metadata server, and DataNodes, where user data is stored. There can be only one active instance of the NameNode. A number of DataNodes (a handful to several thousand) can be part of this HDFS served by the single NameNode.

Here is how a client RPC request to the Hadoop HDFS NameNode flows through the NameNode. This pertains to the Hadoop trunk code base on Dec 2, 2012, i.e. a few months after Hadoop 2.0.2-alpha was released.

The Hadoop NameNode receives requests from HDFS clients in the form of Hadoop RPC requests over a TCP connection. Typical client requests include mkdir, getBlockLocations, create file, etc. Remember – HDFS separates metadata from actual file data, and that the NameNode is the metadata server. Hence, these requests are pure metadata requests – no data transfer is involved. The following diagram traces the path of a HDFS client request through the NameNode. The various thread pools used by the system, locks taken and released by these threads, queues used, etc. are described in detail in this message.

 

  • As shown in the diagram, a Listener object listens to the TCP port serving RPC requests from the client. It accepts new connections from clients, and adds them to the Server object’s connectionList
  • Next, a number of RPC Reader threads read requests from the connections in connectionList, decode the RPC requests, and add them to the rpc call queue – Server.callQueue.
  • Now, the actual worker threads kick in – these are the Handler threads. The threads pick up RPC calls and process them. The processing involves the following:
    • First grab the write lock for the namespace
    • Change the in-memory namespace
    • Write to the in-memory FSEdits log (journal)
  • Now, release the write lock on the namespace. Note that the journal has not been sync’d yet – this means we cannot return success to the RPC client yet
  • Next, each handler thread calls logSync. Upon returning from this call, it is guaranteed that the logfile modification have been sync’d to disk. Exactly how this is guaranteed is messy. Here are the details:
    • Everytime an edit entry is written to the edits log,  a unique txid is assigned for this specific edit. The Handler retrieves this log txid and saves it. This is going to be used to verify whether this specific edit log entry has been sync’d to disk
    • When logSync is called by a Handler, it first checks to see if the last sync’d log edit entry is greater than the txid of the edit log just finished by the Handler. If the Handler’s edit log txid is less than the last sync’d txid, then the Handler can mark the RPC call as complete. If the Handler’s edit log txid is greater than the last sync’d txid, then the Handler has to do one of the following things:
      • It has to grab the sync lock and sync all transcations
      • If it cannot grab the sync lock, then it waits 1000ms and tries again in a loop
      • At this point, the log entry for the transaction made by this Handler has been persisted. The Handler can now mark the RPC as  complete.
    • Now, the single Responder thread picks up completed RPCs and returns the result of the RPC call to to the RPC client. Note that the Responder thread uses NIO to asynchronously send responses back to waiting clients. Hence one thread is sufficient.

There is one thing about this design that bothers me:

  • The threads that wait for their txid to sync sleep 1000ms, wait, sleep 1000ms, wait and continue with this poll. It may make sense to remove the polling mechanism and to replace by an actual sleep/notify mechanism.

That’s all in this writeup, folks.

avatar

About Jagane Sundar

We Just Acquired Big Data / Hadoop Company AltoStor.

I believe that the combination of AltoStor’s expertise and WANdisco’s patented active-active replication technology is the proverbial ‘marriage-made-in-heaven’.  The AltoStor acquisition will enable us to launch products into the highly lucrative Big Data / Hadoop market early next year.

So how lucrative is this market?  Well, I recently read an interesting article in Wikibon “Big Data: Hadoop, Business Analytics and Beyond” that Big data / hadoop market sizereiterated what we already knew.  Big Data isn’t a might happen next year thing.  No, it’s here today, to steal a quote from the excellent article: “Make no mistake: Big Data is the new definitive source of competitive advantage across all industries. Enterprises and technology vendors that dismiss Big Data as a passing fad do so at their peril and, in our opinion, will soon find themselves struggling to keep up with more forward-thinking rivals…. For those organizations that understand and embrace the new reality of Big Data, the possibilities for new innovation, improved agility, and increased profitability are nearly endless.”

So why did we acquire AltoStor?

First off, the founders (Dr. Konstantin Shvachko and Jagane Sundar) are really good guys.  This was an ‘old-school’ acquisition.  An initial deal was struck very quickly with a handshake.  Both sides could see very clear value – so doing the deal was incredibly simple.  I love the fact that they wanted stock as consideration – that’s real proof that they see significant long term-value creation rather than short-term gain.

For WANdisco Big Data is a Big Market.  We can see clear synergy between our unique / patented active-active replication technology and the hadoop logocreation of Hadoop high availability (HA) solutions.  This is one of the reasons why AltoStor was so attractive to us.  They have unique knowledge in the space:

•            The AltoStor founders have been working on Hadoop since its inception in 2006 at Yahoo.  Konstantin was part of Doug Cuttings team that created and implemented Hadoop.  His focus was massive scale, performance and availability of Hadoop – developing the Hadoop Distributed File System (HDFS).  He then went on to eBay where he implemented Hadoop.

•            The Founders are intimately aware of the problem WANdisco is planning to solve around Hadoop HA and hence understand the value of the solution in large scale Big Data replication over a Wide Area Network.

•            Finally, AltoStor are developing a product that is slated for release in Q1 2013, that will significantly simplify deployment of Hadoop / Big Data for enterprises.

Following the acquisition we now expect to have products available in the first quarter of 2013.  That’s very good news.

There’s going to be a lot of noise in this space over the coming months and years.  Many will jump on the ‘bandwagon’, making all sorts of lavish claims to be ‘the big data this’ and ‘the big data that’.  It always happens in hype-cycles like this.  In reality most are just companies repurposing existing legacy products and slapping a new label on it.  This is NOT one of those.  We are building from the ground-up with unique knowledge and information that only a few in the world have (the amount of brain-power in the room during some of the early design meeting was frightening!)

In 2005 when we founded WANdisco my peers would tell me that active-active replication over a Wide Area Network was impossible.  Well we’ve got hundreds-of-thousands of users using the technology for core development every day.  Applying this technology to Hadoop is groundbreaking and I think it will change the way the industry views network storage.  We like making the impossible possible at WANdisco.

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

Setting up Amanda to backup a linux box to Amazon S3

Here is my experience with compiling and configuring amanda to backup to the cloud (S3).

First compiling and installing amanda

  1. Create a user ‘amanda’ in group ‘backup’. Login as this user
  2. Download amanda-3.3.2.tar.gz, unzip
  3. As user amanda, run ‘./configure –enable-s3-device’ and ‘make’
  4. As user root, run ‘make install’
  5. Now this next instruction is basically a hack to workaround some packaging bugs in amanda:
    • chgrp backup /usr/local/sbin/am*

Next, creating a backup configuration called ‘MyConfig’ for backing up /etc to a virtual tape in the directory /amanda

  1. As root, create a directory /amanda, and chown it to amanda:backup
  2. Run the following commands as user amanda:
    • mkdir -p /amanda/vtapes/slot{1,2,3,4}
    • mkdir -p /amanda/holding
    • mkdir -p /amanda/state/{curinfo,log,index}
  3. As root, create a directory ‘/usr/local/etc/amanda/MyConfig’, and chown it to amanda:backup.
  4. As user amanda, create a file /usr/local/etc/amanda/MyConfig/amanda.conf with the following contents:

 

org "MyConfig"
infofile "/amanda/state/curinfo"
logdir "/amanda/state/log"
indexdir "/amanda/state/index"
dumpuser "amanda"

tpchanger "chg-disk:/amanda/vtapes"
labelstr "MyData[0-9][0-9]"
autolabel "MyData%%" EMPTY VOLUME_ERROR
tapecycle 4
dumpcycle 3 days
amrecover_changer "changer"

tapetype "TEST-TAPE"
define tapetype TEST-TAPE {
  length 100 mbytes
  filemark 4 kbytes
}

define dumptype simple-gnutar-local {
    auth "local"
    compress none
    program "GNUTAR"
}

holdingdisk hd1 {
    directory "/amanda/holding"
    use 50 mbytes
    chunksize 1 mbyte
}
  • Create a file /usr/local/etc/amanda/MyConfig/disklist with the following contents:
localhost /etc simple-gnutar-local
  • As root make a directory /usr/local/var/amanda/gnutar-lists, and change ownership of this directory to amanda:backup
  • Next, as user amanda, run the program ‘amcheck MyConfig’. It should finish its report with ‘Client check: 1 host checked in 2.064 seconds.  0 problems found.’
  • Next, run ‘amdump MyConfig’. The program should exit with return value 0
  • Finally, run ‘amreport MyConfig’ to get the status of you backup effort!

That’s it for testing a local backup to the virtual tape /amanda. Now, on to trying this with S3 as the tape backend.

  • Create a new config directory /usr/local/etc/amanda/MyS3Config/amanda.conf with the following contents:
org "MyS3Config"
infofile "/amanda/state/curinfo"
logdir "/amanda/state/log"
indexdir "/amanda/state/index"
dumpuser "amanda"

# amazonaws S3
device_property "S3_ACCESS_KEY" "YOUR_AMAZON_KEY_ID_HERE"
device_property "S3_SECRET_KEY" "YOUR_AMAZON_SECRET_KEY_HERE"
device_property "S3_SSL" "YES"
tpchanger "chg-multi:s3:YOUR_NAME-backups/MyS3Config1/slot-{01,02,03,04,05,06,07,08,09,10}"
changerfile  "s3-statefile"

tapetype S3
define tapetype S3 {
    comment "S3 Bucket"
    length 10240 gigabytes # Bucket size 10TB
}

define dumptype simple-gnutar-local {
    auth "local"
    compress none
    program "GNUTAR"
}

holdingdisk hd1 {
    directory "/amanda/holding"
    use 50 mbytes
    chunksize 1 mbyte
}
  • Enter your S3 Account Key ID, Secret Key and change YOUR_NAME to your name in the above config example
  • Now run the command
    • amlabel MyS3Config MyS3Config-1 slot 1
  •  If this works fine, then run the following command to create 9 more slots
    • for i in 2 3 4 5 6 7 8 9 10; do amlabel MyS3Config MyS3Config-$i slot $i; done;
  • Finally run the following to verify:
    • amdevcheck MyS3Config s3:jaganes-backups/MyS3Config1/slot-10
  • Now you are ready to run a backup. Define one for the /etc directory on your machine by adding the file /usr/local/etc/amanda/MyS3Config/disklist with the following line:
    • localhost /etc simple-gnutar-local
  • Try running a backup using the command:
    • amdump MyS3Config
  • Check the return value of the command, and if that is 0, check the contents of the backup bucket on S3 using S3fox or some such tool.

That’s all folks. You have created a backup of your linux machine’s /etc directory on Amazon’s S3 cloud storage service.

It would be interesting to try this with Amazon’s glacier service, which is much cheaper than S3

avatar

About Jagane Sundar

Running hdfs from the freshly built 0.23.3 source tree in distributed mode

Download the Hadoop Creation Scripts used in this post here

Here is what I did to run the newly built hadoop in distributed mode using three CentOS VMs. I have these VMs, and the scripts that I used for creating hadoop available for download later in this blog post.

First, setting up three VMs for running Hadoop

  • Install ESXi on a suitable machine. I used a 4 core machine with 8GB of RAM and two 300GB disks
  • Create three 32 bit CentOS 6.3 VMs. Each VM has 32GB of disk space, and 2GB of RAM. To the CentOS installer, I specified ‘Basic Server’ as the installation type.
  • The root password is altostor
  • I assigned IP addresses 10.0.0.175, 10.0.0.176 and 10.0.0.177 to the three virtual machines. I assigned the hostnames master.altostor.net, slave0.altostor.net and slave1.altostor.net to these machines and created a hosts file.
  • I setup passwordless ssh so that the ‘root’ user can ssh from master.altostor.net to slave0 and slave1 without typing in a password. The web is full of guides for this, but in brief:
    •  Turn off selinux on the three VMs by editing /etc/selinux/config and setting SELINUX=disabled.
    • Run ‘ssh-keygen -t rsa’ on the master
    • Create a /root/.ssh directory on slave0 and slave1 and set its permissions to 700
    • scp the file /root/.ssh/id_rsa.pub to slave0 and slave1 as /root/.ssh/authorized_keys. Set permissions of authorized_keys to 644. On the master node itself, copy the id_rsa.pub as authorized_keys.
    • From master, test that you can ssh into ‘slave0′, ‘slave1′ and to itself ‘master’
    • Create a group hadoop and a user hdfs in this new group on all three machines

Next, creating the hadoop. Download the tar file with the scripts and config files necessary for installing hadoop from the link at the top of this posting. Download the zip file, unzip on the master, drop the hadoop binary hadoop-0.23.3.tar.gz, and run ‘./create-hadoop.sh’. The script does the following:

  • master setup:
    • kill all java processes
    • delete data directory, pid directory and log directory
    • Copy config files (config file templates are included with the script tar file) and customize
    • Create the slaves file with slave0 and slave1 in it.
    • formats the HDFS filesystem
    • starts up the namenode and creates /user in hdfs. Note that the NameNode java process is running as the linux user root
  • slave setup
    • kills all running java processes
    • removes hadoop data, pid and log directory
    • scp hadoop binaries directory over from master to slave
    • starts up the hadoop DataNode process as linux user root

 

 

avatar

About Jagane Sundar

Running hdfs from the freshly built hadoop 0.23.3 in pseudo distributed mode

So, here’s what I did to run the freshly built hadoop 0.23.3 bits in pseudo distributed mode (this is the mode where each of the hadoop daemons runs in its own process in a single physical/virtual machine).

First I configured passwordless ssh back into the same machine (localhost). I needed to turn off selinux on this CentOS 6.3 VM in order to accomplish that. Seems like selinux is working very hard to make CentOS/Redhat completely unusable. I edited /etc/selinux/config and changed the line SELINUX=enforcing to SELINUX=disabled. Reboot, and then ‘ssh-keygen’ and then ‘ssh-copy-id -i ~/.ssh/id_rsa.pub jagane@localhost’

Now, I untarred the file <src_root>/hadoop-dist/target/hadoop-0.23.3.tar.gz into /opt. mkdir /opt/hadoop-0.23.3/conf, /opt/nn and /opt/dn. Create the following files in /opt/hadoop-0.23.3/conf.

/opt/hadoop-0.23.3/conf/core-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:8020</value>
 </property>
</configuration>

/opt/hadoop-0.23.3/conf/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
 <name>dfs.replication</name>
 <value>1</value>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>/opt/nn</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name>
 <value>/opt/dn</value>
 </property>
</configuration>

/opt/hadoop-0.23.3/conf/hadoop-env.sh

export JAVA_HOME=/usr/java/java
export HADOOP_HOME=/opt/hadoop-0.23.3
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/conf/
export YARN_CONF_DIR=~${HADOOP_HOME}/conf/

Now run the following command to format hdfs

$ (cd /opt/hadoop-0.23.3; ./bin/hdfs namenode -format)

Next, startup the namenode as follows:

$ (cd /opt/hadoop-0.23.3; ./sbin/hadoop-daemon.sh --config /opt/hadoop-0.23.3/conf start namenode)

Finally, start up the datanode

$ (cd /opt/hadoop-0.23.3; ./sbin/hadoop-daemon.sh --config /opt/hadoop-0.23.3/conf start datanode)

At this point, running the command jps will show you the datanode and the namenode running.

That’s all for now.

avatar

About Jagane Sundar

Compiling Hadoop 0.23.3 from source

Just a quick blog entry – here’s what I did to compile hadoop 0.23.3 from source.

  • Install CentOS 6.3 in a VM using vmware player. I used a VM with 100GB disk space and 1GB memory. More memory will probably help
  • Download the source bundle, and untar it
  • yum groupinstall “Development Tools”
  • change uid for the user ‘jagane’ from 500 to 1000
  • yum install cmake
  • yum install openssl
  • yum install openssl-devel
  • Download google protobuf sources, ./configure, make, make install.
  • install the latest maven
  • mvn clean install package -Pdist -Dtar -Pnative -DskipTests=true

Look for the built tar.gz files in hadoop-dist/target.That’s all

avatar

About Jagane Sundar