Big Data on ice
August 05, 2014
For every 30 minutes of operation, a Boeing jet engine generates 10 terabytes of data. That’s 1 trillion bytes, or the digitized equivalent of the entire printed collection of the Library of Congress. Multiply that by the number of jet engines propelling more than 87,000 flights over the U.S. every day, and you’ve got yourself a lot of data.
Despite speculation by some that Big Data won’t translate into the analytical cure-all it’s been cracked up to be, it’s still inextricably linked to the Internet of Things (IoT). If projections from the likes of Cisco are correct, by the year 2020 we will have roughly 50 billion Internet-enabled “things” constantly chirping away, and although they may not all generate data at a jet engine clip, that definitely constitutes a data tsunami.
So, what are we going to do with all of that data? Harvested information must have value at some level, otherwise it wouldn’t have been harvested in the first place. On the other hand, all data is not created equal. The vast majority of data collected will likely be archived and forgotten until it’s needed for the occasional report, accessed once or twice, and then forgotten again. According to research from the Enterprise Strategy Group, this “infrequently accessed” information (also known as Tier 3 or “cold” data) accounts for as much as 80 percent of recorded data. And, although the average cost of memory has declined sharply over the years, pennies per gigabyte can add up very quickly on a Big Data scale.
The obvious answer to this information overload are cold data storage alternatives that are cheaper and have more capacity than those used for data accessed on a regular basis. As a result, companies have typically chosen one of two solutions: the time-honored tape library, or, more recently, the cloud.
Tape libraries have been in use for decades, and are excellent for storing large quantities of data at extremely low cost. They can also be considered “green” because tape drives only spin when in use (which saves power), and being located on-premise allows cold data to be accessed relatively quickly. However, tape libraries also have some drawbacks, including considerable up-front expense for mid- to large-scale storage systems, difficult remote access, the possibility of tape degradation, and the vulnerabilities of maintaining archives in a single, on-site location (instead of “data tsunami” think “data” and “tsunami”).
More recently, companies have also started exploring storage possibilities in the cloud, which makes up for some of the misgivings of tape libraries by providing infinite storage space, low cost, and a remote capacity that protects against theft, natural disaster, etc. The disadvantages of cloud solutions, though, are that retrieving data is often very time consuming and can become costly depending on the amount of data being retrieved. For example, services like Amazon Glacier require a minimum of 3-5 hours for retrieval of a data set (which is available to download for 24 hours), and charge by the gigabyte if more than 5 percent of your data is retrieved in a given month.
An improvement, it seems, would exist at the intersection of the two, and incorporate hardware and software elements that optimize access while ensuring the lowest possible cost per gigabyte of storage.
Cold storage: Big Data on ice
Software-defined storage (SDS) is new terminology, but from a technology standpoint is similar to software-defined networking (SDN) in that hardware logic is abstracted into a software layer that manages storage infrastructure. In essence, this means that storage features or services (like de-duplication, replication, snapshots, and thin provisioning) can be virtualized, enabling converged storage architectures that run on commodity hardware. Therefore, it is possible to implement cost-effective storage strategies that combine the accessibility and efficiency of tape libraries with the scalability and remote capabilities of the cloud.
For example, RGS Cold Storage powered by Storiant is an on-premise storage solution for Tier 3 data that’s based on off-the-shelf hardware from RGS, a business unit of Avnet, Inc. (Figure 1). The cabinet-level appliances are fully integrated with 60 HDD bays that offer petabyte-scale capacity, and leverage the OpenZFS-based Storiant software (formerly SageCloud) to interface with a private cloud. The Storiant data management software also improves access performance, yielding retrieval times as fast as 30 seconds for data in a stagnant state, while allowing the HDDs to spin down when not in use to significantly reduce power consumption. At $0.01 per gigabyte of storage per month, the scalable RGS Cold Storage architecture is cost-optimized for the majority of Big Data deployments.
While storage management technologies such as SDS help set the table for valuable business analytics, they also ensure that financial and compute resources are available for the “Tier 1” data that is executed on a regular basis. In an environment where too much information can realistically become a bad thing, it’s important to keep some of it in the deep freeze. For more information on SDS, visit www.openstack.org.