In the era of big-data, organizations rely on large sets of data to help them run, quantify, and grow their operations.
Big-data takes business analytics beyond databases and into the realm of terabytes and even petabytes of unstructured data. Furthermore, this data no longer resides in one location. With cloud computing and so many reams of social media and clickstream activity available publicly, this information is truly distributed.
The reality really sets in when we look at the numerical information provided by some organizations:
IBM recently released a study showing that users create over 2.5 quintillion bytes of data every day. It also found that 90 percent of all the data in the world has been created over the last two years.
Retail giants like Walmart have IT systems that process millions of transactions per second. Furthermore, because of its size and the amount of product it carries, Walmart has to manage over 2.5 petabytes of data.
Database administrators have been forced to find new and creative ways to manage and analyze this vast amount of information. To help in that quest, there are several important open-source data management options that large organizations should evaluate:
Apache Hadoop. One of the technologies that evolved into the standard in big-data management is Apache Hadoop. This open-source data management framework is a workhorse for information- and compute-intensive distributed applications. The flexibility of the Hadoop platform allows it to run on commodity hardware and it supports structured, semi-structured, and even unstructured datasets.
Apache HBase. This big-data management platform was built around Google’s very powerful BigTable management engine. As an open-source, Java-coded, distributed database, HBase is designed to run on top of the already widely used Hadoop environment. Apache HBase was adopted by Facebook to help with its messaging platform needs.
MongoDB. This solid platform has been growing in popularity. MongoDB was originally created by DoubleClick and is now being used by several companies as an integration tool for big-data management. Designed on an open-source, NoSQL engine, MongoDB stores and processes structured data on a JSON-like platform, which is easy to use for coders and machines alike. Currently, organizations such as The New York Times, Craigslist, and a few others have adopted MongoDB to help them control big datasets.
Our new “data-on-demand” society has yielded vast amounts of information. Whether these are photos on social networks or international retail transactions, the amount of good, quantifiable data is increasing. The only way to control this growth is to quickly deploy an efficient management solution.
I know there are a lot of other open-source big-data options out there. Where have you seen success and what have you been using?
Are we really controlling the growth or are we surrendering to simple human behavior?
I've had this ticking away in the back of my head for years now. I can't get past the idea that, at least when it comes to data, most people are hoarders.
An example: I encountered a person who was collecting clickstream data from a busy, multinational website. (This was back when clickstream was just starting to be a thing, maybe even a bit before that.) This fellow collected the data (the means aren't important), exported it and then burned it onto CDR because he didn't have enough storage on his computers. He then put the CDR in his desk. This was a daily task for him. When we first interacted with him, he'd been doing this for months. He had a desk full of CDRs. He had no system for importing that data back into something that could do analysis, or even just perform simple searches. He didn't have a good idea of what metrics people would really be interested in. He had a lot of data he was convinced was useful. He was collecting the data because he could and the costs of storage weren't "too much". He was hoping that someone would come up with a reason to have the data, all the while performing his daily ritual.
IIRC, he stopped his project when we looked at the costs of creating a system and standing it up. (That would have been a beastly clustered RDBMS system with a second, identical warm failover system. SAN storage was required. There were chargebacks. All of that means "expensive" and time-consuming (and possibly unnecessary but that was our standard).
Nowadays, he would chuck that into a big data system running on cheaply scalable, virtualized, opensource software. He might not even need IT support, if he can buy cloud resources on a corporate expense account.
@hash.era - Well, I'd imagine there are a few ways to manage big data. In fact, big storage vendors are jumping into the big data game. There is direct integration with big data analytics engines already from vendors like EMC and NetApp.
These big data and BI engines can be resource intensive. Now, from a backup perspective - some of the more refined editions of big data management solutions aren't always free. As mentioned earlier, bringing in a solution that has backup, redundancy and even mirroring will probably cost an enterprise license.
@Michael.Steinhart - That's a great question. Many smaller organizations looking to start out are trying the free versions of these products. Remember, the entire Apache Hadoop "platform" is now commonly considered to consist of the Hadoop kernel. This means that technologies like MapReduce and Hadoop Distributed File System (HDFS), as well as Apache Hive, Apache HBase are all built on the Hadoop model.
The product itself can be deployed at no charge. In many cases, it's the integration with other systems that comes with a cost. Furthermore, the ability to tie into other paid products will depend on the edition of the Hadoop platform that's being deployed.
Let's look at a specific example, MapR can be obtained for free or through two other purchased editions. The paid editions offer more features like instant node recovery, volume-based data management for tables, and Snapshots for tables. Paid versions also include support.
I wanted to mention one additional tool that i tried called CouchDB. it is also free and something that I was introduced to while I was working with Apple. It was enough to spark my interest in NoSQL, so I keep up with it even now. I always following tools that relate to Data.
Thanks for this helpful overview, Bill. I've seen Hadoop available free and also available with value-adds like support and depoyment templates and vertical-focused integration. Which do you recommend for a company that wants to leverage the platform for BI?
Bill Kleyman, our guest in Tuesday's TECS Radio Show discussion of software-defined storage, takes a look at how the move to software-defined technologies in general is bringing dramatic change to the IT organization.