Big data analytics and cloud computing are the two biggest technology trends today. They are two of the four key pillars in IDC's Third Platform definition. With the growing popularity of public clouds such as AWS and Azure, enterprises of all sizes are seriously looking at running any and all workloads in the cloud to achieve business agility, cost savings, and faster innovation.
As enterprises start evaluating analytics on Hadoop, one of the questions they frequently ask is: "Can we run Hadoop in the cloud without any negative trade-offs?" We strongly believe that in the long term Hadoop will live in the hybrid cloud. However, important considerations need to be addressed to make cloud deployment successful.
At a fundamental level, there are three options for running Hadoop analytics in the cloud.
Option 1: Hadoop-as-a-service in the public cloud. Solutions such as Amazon's EMR (Elastic MapReduce) and Azure HDInsight claim to provide a quick and easy way to run MapReduce and Spark without having to manually install the Hadoop cluster in the cloud.
Option 2: Pre-built Hadoop in the public cloud marketplace. Hadoop distributions (such as Cloudera CDH, IBM BigInsights, MapR, Hortonworks HDP) can be launched and run on public clouds such as AWS, Rackspace, MS Azure, and IBM Softlayer.
Option 3: Build your own Hadoop in the public cloud. Public clouds offer Infrastructure-as-a Service (IaaS) solutions such as AWS EC2 that can be leveraged to build and manage your own Hadoop cluster in the cloud.
All three options can be good for various analytics use cases and strongly complement the on-premises Hadoop deployment on bare metal or private cloud infrastructure. For example, on-premises Hadoop deployment is a good choice where the source data lives on-premises; this option typically requires ETL from various discrete sources and ranges from hundreds of terabytes to several petabytes in capacity. Public-cloud-based Hadoop deployments are a good option when the data is generated in the cloud (e.g., analyzing Twitter data) or on-demand analytics (if it is cost-effective, secure, and easy to regularly migrate the source data from on-premises to the cloud).
We strongly believe that ultimately Hadoop will live in the hybrid cloud. There are plenty of on-premises deployments ramping up to hundreds of nodes as the design considerations are better understood. However, we are also starting to see some early-stage customer success stories involving public-cloud-based Hadoop deployments.
When thinking about Hadoop analytics in the cloud, you must answer some key questions. The remainder of this article focuses on what questions to ask when choosing a public cloud solution for Hadoop analytics.
Question 1: Does the cloud infrastructure guarantee consistent performance?
Traditionally, Hadoop provides architecture guidelines for ensuring consistent performance so that business insights can be achieved faster. In a public-cloud deployment, it is important to understand if the cloud provider can guarantee performance and know the associated costs for such performance. You may be sharing the underlying infrastructure with other enterprises. As a result, you will likely have no control over which physical server the VM is using and what other VMs (yours or other customers) are running on the same physical server. You may run into a noisy neighbor problem if VMs from other customers run rogue on the servers on which your VMs are running and there are no quality of service (QoS) policies in place.
Question 2: Does the cloud option offer high availability similar to the on-premises Hadoop deployment?
Hadoop provides architecture guidelines to ensure high availability against hardware failures. In a cloud-based deployment, there is no "rack awareness" that you have access to, something that can be configured in the namenode for on-premises deployment. In a cloud deployment, it is important to understand how high availability is maintained, especially for protecting against rack failures.
Question 3: Can the cloud offer flexible and cost economical scaling of computing resources?
Hadoop requires linear scaling of computing resources as the data you want to analyze keeps growing exponentially (doubling every 18 months according to research). It is important to understand the cost implications of scaling the computing power infrastructure and to carefully pick computing-resource capacity. Not all compute nodes are made equal. There is a buffet of different compute nodes. Some are heavy on processors while others give you more RAM. Hadoop is a compute-intensive platform and the clusters tend to be heavily utilized. It is important to pick compute nodes with beefier processors and higher amounts of RAM.
Sometimes you can upgrade your compute nodes to a higher level, but that's not true of all the compute nodes that are available. Some premium nodes are only available in certain data centers and therefore cannot be upgraded to if your existing compute nodes are provisioned in another data center.
Question 4: Does the cloud offer guaranteed network bandwidth required for Hadoop operations?
Ensuring high availability and data-loss prevention in Hadoop on DAS storage requires making three copies of the data replicated over a dedicated 10GbE network between the nodes. Highly reliable and high-performance, enterprise-grade storage options for Hadoop (e.g., NetApp E-Series storage array) require only two copies of data, thereby reducing the network bandwidth requirements by 33 percent compared to the DAS based deployment.
Hadoop also requires guaranteed network bandwidth for efficiently running MapReduce jobs because the ultimate goal is to finish the MapReduce jobs quickly to achieve business insights. For a cloud-based deployment, you must understand if guaranteed network bandwidth option is available and what that will cost. Typically in cloud deployments, the physical network is securely shared between multiple tenants, so it is critical that you understand any QoS policies to ensure network bandwidth availability.
Question 5: Can the cloud offer flexible and cost-effective scaling of storage?
Capacity and performance are the two important storage considerations for scaling Hadoop. From the capacity perspective, traditional Hadoop deployment requires replicating data three times to protect against data loss resulting from physical hardware failures. This means you need to factor in three times the storage capacity requirements in addition to the network bandwidth requirements to create three copies of data.
Find out how the cloud cost-per-GB compares with the on-premises deployment taking the 3x redundancy factor into consideration. Shared storage options such as the NetApp E-Series [full disclosure: Mr. Joshi works for NetApp] requires only two copies of data and the NetApp FAS w/NFS Connector for Hadoop in-place analytics requires only one copy of the data. Therefore, to save money, you could host a shared storage solution in a co-located facility and leverage the compute nodes from public clouds such as AWS and Azure.
From a performance perspective, Hadoop by design assumes the availability of high-bandwidth storage to quickly perform sequential reads and writes of large-block I/O (64K, 128K, and 256K) to complete the jobs faster. Flash media (e.g. SSDs with high IOPS and ultra-low latency) can help quickly complete Hadoop jobs. Performance can be ensured with on-premises deployment by using several servers with DAS storage (SSDs for performance) or by using the high-bandwidth shared storage options such as the NetApp E-Series/EF All-Flash Array that maintain performance even during disk drive failures and RAID rebuild operations. With cloud-based deployments, you must understand how high bandwidth, low latency, and high IOPS will be guaranteed and know the associated costs.
Question 6: Is data encryption option available in the cloud-based Hadoop solution?
Data encryption is becoming a key corporate security policy and government mandate, more so for industries such as healthcare. Therefore, find out if the cloud-based Hadoop deployment natively supports data encryption and learn the associated performance, scaling, and pricing implications.
Question 7: How easy and economical is to get the data in and out of the cloud for Hadoop analytics?
Cloud provides use different pricing models for ingesting, storing, and moving data out of the cloud. Also, there are pricing implications around the WAN network required to transfer the data to the cloud. Not every feature you need is available in every cloud location, so you may need to provision your Hadoop cluster in certain locations -- close or far from your on-premises data center. Get a detailed understanding of how much it will cost to move the data in and out of the appropriate cloud location. Some storage vendors offer data fabric solutions allowing you to easily replicate the data between on-premises and co-location facilities, and leverage compute farms from the public cloud. Such options can help you save money because you are not copying the data in and out of the AWS cloud yet still leveraging the cloud's compute benefits.
Question 8: How easy is it to manage the Hadoop infrastructure in the cloud?
As deployment scales, you may have hundreds to thousands of nodes in the Hadoop cluster with petabytes of data that may also require backups and disaster recovery. It becomes important to understand the management implications of scaling your Hadoop cluster in the cloud.
Hadoop-based big data analytics is gaining traction with a majority of early adopters who started their deployment in an on-premises data center. The public cloud has definitely matured today compared to just a few years ago. We are seeing enterprises across different industry segments starting to explore public cloud for Hadoop for a variety of reasons. We hope that this article provided valuable insight into the key things to consider when looking at public cloud for Hadoop. Our belief is that in the long term, Hadoop will live in the hybrid cloud with both the on-premises and cloud deployment options providing value for different use cases.