In the era of big-data, organizations rely on large sets of data to help them run, quantify, and grow their operations.
Big-data takes business analytics beyond databases and into the realm of terabytes and even petabytes of unstructured data. Furthermore, this data no longer resides in one location. With cloud computing and so many reams of social media and clickstream activity available publicly, this information is truly distributed.
The reality really sets in when we look at the numerical information provided by some organizations:
- IBM recently released a study showing that users create over 2.5 quintillion bytes of data every day. It also found that 90 percent of all the data in the world has been created over the last two years.
- Retail giants like Walmart have IT systems that process millions of transactions per second. Furthermore, because of its size and the amount of product it carries, Walmart has to manage over 2.5 petabytes of data.
Database administrators have been forced to find new and creative ways to manage and analyze this vast amount of information. To help in that quest, there are several important open-source data management options that large organizations should evaluate:
- Apache Hadoop. One of the technologies that evolved into the standard in big-data management is Apache Hadoop. This open-source data management framework is a workhorse for information- and compute-intensive distributed applications. The flexibility of the Hadoop platform allows it to run on commodity hardware and it supports structured, semi-structured, and even unstructured datasets.
- Apache HBase. This big-data management platform was built around Google’s very powerful BigTable management engine. As an open-source, Java-coded, distributed database, HBase is designed to run on top of the already widely used Hadoop environment. Apache HBase was adopted by Facebook to help with its messaging platform needs.
- MongoDB. This solid platform has been growing in popularity. MongoDB was originally created by DoubleClick and is now being used by several companies as an integration tool for big-data management. Designed on an open-source, NoSQL engine, MongoDB stores and processes structured data on a JSON-like platform, which is easy to use for coders and machines alike. Currently, organizations such as The New York Times, Craigslist, and a few others have adopted MongoDB to help them control big datasets.
Our new “data-on-demand” society has yielded vast amounts of information. Whether these are photos on social networks or international retail transactions, the amount of good, quantifiable data is increasing. The only way to control this growth is to quickly deploy an efficient management solution.
I know there are a lot of other open-source big-data options out there. Where have you seen success and what have you been using?