A recent O’Reilly Radar article on big data in the cloud included a mention of RightScale and its ability to orchestrate server management across multiple clouds. It prompted me to think about the role of RightScale in supporting the implementation of big data solutions for our users. But first, what exactly is big data? Edd Dumbill, a contributor to O’Reilly Radar, offers this explanation:
“Big data is data that exceeds the processing capacity of conventional database systems[because it] is too big, moves too fast, or doesn’t fit the structures of your database architectures.”
A commonly accepted way to identify big data is to determine if it meets any of these three key criteria:
Volume – The quantity of data is too large to fit into conventional database structures. There is a tremendous amount of information that can be gained by analyzing data stored on social networks, for example, but you have to analyze all of the data to extract the portion for your need at hand. Here are a few examples of the volume of data stored in social networks:
- Facebook is expected to have more than 1 billion users by August 2012, handles 40 billion photos, and generates 10 TB of log data per day.
- Twitter has more than 100 million users and generates some 7 TB of tweet data per day.
Velocity – Data is being generated at a meteoric pace. In many cases, data needs to be processed in real time or its value is diminished tremendously. Example use cases include:
- Sensor Networks – A single airplane engine generates more than 10 TB of data every 30 minutes. In the past, this data was deleted at the end of each flight. How would you feel if you knew that your airline was instead using this data to proactively monitor the health of its engines and replacing them before they failed?
- Stock Market – For every trading session, the NYSE captures 1 TB of trade information. In this case, processing stale data is worse than processing no data at all.
Variety – Did you know that 80 percent of the world’s data is unstructured? This data does not easily fit into the structures of a relational database, so storing and analyzing it requires a new approach. Two examples of unstructured data mining are:
- Transaction Records – A recent Forbes article highlights how Target tailors marketing campaigns to pregnant women. The company accomplishes this by analyzing customer transaction records and assigning each shopper a “pregnancy prediction.”
- Social Media Accounts – There is invaluable data contained in Facebook and LinkedIn users’ posts, emails, and text messages. But conventional data analytics platforms do not easily allow for storage or analysis of free-form text.
What is the connection between big data and cloud computing?
Analyzing big data can potentially lead to big profits, but is it worth the time and money? I recently attended a seminar where cloud computing was described as “the democratization of software.” I think this is particularly apt to any discussion of big data because, historically, analyzing a large data set required significant investments of both time and money. By using servers on a pay-as-you-go basis, you not only can reduce your costs but also benefit from increased compute resources to faster analyze your data. For example, let’s say you want to run a monthly report with Hadoop that takes 100 compute hours to complete.
Option 1: Build a 10-node Hadoop cluster
- Your job will take 10 hours to complete.
- Your cluster will be utilized for 10 hours out of each month and will remain idle for the remaining 720 hours.
Option 2: Build a 100-node Hadoop cluster in the cloud
- Your job will take 1 hour to complete.
- Your cluster will be terminated after the job is complete, and you will only be charged for the compute resources used.
With Option 2, you can gain access to the same tools everyone else has at a fraction of the cost. There is no need for a multi-million dollar data warehousing solution. RightScale and our partner, IBM, offer a solution for performing intensive business analytics, batch processing, and grid computing in the cloud of your choice. IBM has developed a set of RightScale ServerTemplatesTM for its Hadoop-based BigInsights product. The best part is that the BigInsights Basic Edition ServerTemplates are available at no charge for data sets up to 10 TB. IBM also offers BigInsights Enterprise Edition ServerTemplates, which include many additional tools and features. These ServerTemplates are available in the RightScale MultiCloud Marketplace.