Big Data in the Cloud using Cloudify

Posted By: Nati Shalom on March 20, 2012


Edd Dumbill wrote an interesting article on O'Reilly Radar covering the current solutions for running Big Data in the Cloud

Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide.

Edd touched briefly on the role of PaaS for delivering Big Data applications in the cloud

Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to configure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.

Even though Edd’s article covers all the different forms of running Big Data on private and public clouds, the article focuses mainly on the public cloud offering from Amazon, Microsoft and Google.

In this post, I wanted to cover more specifically how I see the evolution of cloud application platforms (PaaS) to support Big Data. I’ll refer specifically to Cloudify which was designed primarily to support Big Data applications. 

 

Cloud-storage-and-computing-one

 

Big Data in the cloud using Cloudify

Background

Most of the PaaS solutions out there started by focusing on simple web application deployments on Ruby, Java and Node.js. Unlike other PaaS solutions, when we designed Cloudify we picked Big Data as the primary target for Cloudify, and started by supporting popular NoSQL clusters such as Cassandra and MongoDB, as well as providing the equivalent of Amazon RDS by providing recipes for MySQL. Our goal was to make Big Data deployments a first class citizen within Cloudify. To this end,when you download Cloudify you'll notice that ALL of our examples comes with pre-integrated Big Data deployments.

There are couple of reason that brought us to make that decision: 

  • Managing large data clusters is a core expertise at GigaSpaces

Most people know GigaSpaces for our In-Memory Data Grid solution known as XAP (eXtreme Application Platform). Over the past 10 years, as our customer deployments grew substantially, we realized that developing strong automation and cluster management is as critical as handling data consistency, performance, and latency in of our data-grid product. In a large cluster, if something breaks it’s going to literally be impossible to handle that failure through manual procedures.  For that reason, we developed lots of IP around automation of data cluster deployment which resulted in a unique self-managed data cluster.

  • Cloudify is a natural evolution of GigaSpaces Data Cluster

When we built Cloudify it made a lot of sense to take the IP that we developed for managing GigaSpaces cluster and simply generalize it so it would fit with any other framework. In this way, we could leverage the years of experience  as well as development in this area, and gain a significant head-start.

  • Big Data applications are complex

Big Data applications tend to be fairly complex, which makes them an ideal candidate for the sort of automation and management that Cloudify can offer.

  • Big Data applications have a lot in common with XAP applications

Both need automation of data, failover and recovery, both fit into large cluster deployments, and both share similar partitioning and other clustering architecture. 

What makes Big Data platform different than any other application

Most of the existing orchestration systems were designed to handle stateless processes. Moving data is a completely different ballgame as you need to think of:

  • Primary and Backup dependency 
  • Availability - moving data without losing it.
  • Moving processes to the data rather than the other way around.
  • Data replication within and across sites 

Automating any of these processes through general orchestration tooling like Chef or Rightscale can become a fairly involved and complex process, with lots of pitfalls with handling edge scenarios for example the handshake process that is often involved when automating a data node failure, including a split-brain scenario.

In Cloudify we were able to curve out lots of that logic from the user, for example Cloudify will automatically ensure that primaries and backups won’t run on the same node or data center in case of disaster recovery. You don't need to do anything but tag your cluster nodes with a zone-tag. 

Managing Big Data applications != Managing Big Data storage

Managing data clusters is one thing. Being able to process the data is yet another challenge that we need to think about when we’re dealing with application platforms as I noted in one of my earlier post.

The main challenge is that quite often the management of the data processing logic is built on completely different scaling, availability and monitoring tools than the one used for managing our Big Data deployment. It turns out, that this silo thinking leads to a whole set of complexities starting from the inconsistency in having multiple managers, each determined in a different way when there is a failure or scaling event, and that quite often end up conflicting with one another. Having lots of moving parts is yet another challenge that makes the entire deployment pretty much a complete mess.

Built-in support for In Memory Stream based processing

In an earlier post, 5 Big Data predictions for 2012, Edd pointed out that streaming data processing is going to be one of the main trends for Big Data. 

Over the next few years, we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.

Being also part of XAP, Cloudify already comes with built-in support for streaming Big Data processing. This means that building your own Facebook or Twitter-like real-time analytics can be as a simple as writing only small scripts that handle the analytics counters. All the rest, i.e. scalability, availability, automation, cloud portability, management and monitoring, is covered by Cloudify as noted in this and this use case. 

Examples for Big Data applications running on Cloudify 

In the list below, I tried to put together a couple of references and examples that will make it easier for you to get started. The first reference points to a simpler scenario that will allow you to use Cloudify to deploy your Big Data database as a service. The other three references are full-application stack deployments that include the data-processing and web-tier of applications managed together with the Big Data database. 

Running a Big Data 'Database as a Service'

Cloudify comes with built-in recipes for Cassandra and MongoDB, as well as Solr (popular search engine), which makes it easy to deploy these database clusters on your local machine, data center or private/public cloud through a single command. In this way, you can use Cloudify to automate the database. 

Spring Travel application with Cassandra

Demonstrating a deployment of a JAVA-based commerce application with Cassandra as the database

The example includes recipes that provision the Cassandra database, create a schema, load the data, then spawn a Tomcat container, automatically injecting the reference to Cassandra, that then shows the custom management and monitoring of the application - all through a single command. A recent videocast showing how the travel application works on HP OpenStack cloud is available here.

Pet Clinic example with MongoDB

The Pet Clinic example does pretty much the same thing only using a sharded MongoDB cluster.

Twitter Real Time analytics example for Big Data 

The Twitter example  shows how you can attach real time stream-based processing for handling real live Twitter feeds, and how you can manage both the stream processing cluster and the Big Data cluster using Cloudify and run it on any cloud. The entire source code for this example is provided on Github.

Give it a try

To try out any of the examples you'll need to download Cloudify (latest) or (stable) build. Cloudify comes with all first three examples as part of the distribution under the recipes or examples directory. To try out these examples simply follow the steps from the Cloudify Quick Start Guide.  

 References:

blog comments powered by Disqus