Big Data ETL Across Multiple Data Centers

Scientific applications, weather forecasting, click-stream analysis, web crawling, and social networking applications often have several distributed data sources, i.e., big data is collected in separate data center locations or even across the Internet.

In these cases, the most efficient architecture for running extract, transform, load (ETL) jobs over the entire data set becomes nontrivial.

Hadoop provides the Hadoop Distributed File System (HDFS) for storage and YARN (Yet Another Resource Negotiator) as the programming model in Hadoop 2.0. ETL jobs use the MapReduce programming model to run on the YARN framework.

Though these are adequate for a single data center, there is a clear need to enhance them for multi-data center environments. In these instances, it is important to provide active-active redundancy for YARN and HDFS across data centers. Here’s why:

1. Bringing compute to data

Hadoop’s architectural advantage lies in bringing compute to data. Providing active-active (global) YARN accomplishes that on top of global HDFS across data centers.

2. Minimizing traffic on a WAN link

There are three types of data analytics schemes:

a) High-throughput analytics where the output data of a MapReduce job is small compared to the input.

Examples include weblogs, word count, etc.

b) Zero-throughput analytics where the output data of a MapReduce job is equal to the input. A sort operation is a good example of a job of this type.

c) Balloon-throughput analytics where the output is much larger than the input.

Local YARN can crunch the data and use global HDFS to redistribute for high throughput analytics. Keep in mind that this might require another MapReduce job running on the output results, however, which can add traffic to the WAN link. Global YARN mitigates this even further by distributing the computational load.

Last but not least, fault tolerance is required at the server, rack, and data center levels. Passive redundancy solutions can cause days of downtime before resuming. Active-active redundant YARN and HDFS provide zero-downtime solutions for MapReduce jobs and data.

To summarize, it is imperative for mission-critical applications to have active-active redundancy for HDFS and YARN. Not only does this protect data and prevent downtime, but it also allows big data to be processed at an accelerated rate by taking advantage of the aggregated CPU, network and storage of all servers across datacenters.

– Gurumurthy Yeleswarapu, Director of Engineering, WANdisco

0 Responses to “Big Data ETL Across Multiple Data Centers”

  • No Comments

Leave a Reply