New Data Types Drive New and Enriched Data-Storage Solutions
IBM Research is changing how unstructured data is stored.
By Jim Utsler10/01/2020
Businesses are acquiring data in new ways. Today, databases are often automatically updated based on new data streams, including Internet of Things (IoT) and portable devices.
These technologies are resulting in changes to data types. Unstructured data, for example, whether in the form of images, video or music, doesn’t necessarily fit into the relational database model, despite the fact that this model is still valuable. Additionally, advanced analytics, AI, deep learning and cloud computing are changing how data is exploited. All of this together is pushing the adoption of newer data capture, retention, and access solutions, such as those associated with object stores.
This is to organizations’ benefits—particularly in hybrid cloud and hybrid multicloud environments. By using object stores as part of their larger data storage environment, for example,businesses can save on hardware and query costs effortlessly scale, lower query performance times and improve overall data security.
“To a degree, this was initially driven by tools such as Apache Hadoop and its MapReduce programming model, and the more modern Apache Spark” says Michael Factor, IBM Fellow and IBM Research global co-lead for Hybrid Data strategy. “Today, a set of open formats is available for storing rectangular data, which is data that is stored as rows and columns. These formats make rectangular data processible, but also can result in the creation of almost unlimited sizes of data; terabytes and tens of terabytes and beyond. Data being treated this way often comes from webs of IoT devices that are gathering, for instance, data about the weather.” The sheer amount of data demands new methods to access that data needed for specific tasks.
“It’s a continual process, with different data types becoming so significant and undeniable that parts of the underlying data storage infrastructure have to be altered to meet evolving data demands.”
Object Store Integration
But methods must be implemented to manage all forms of data. This has to be independent of data processing tools, such as relational databases and MapReduce, or use case, including analytics, business intelligence and business process automation. This is where object stores come into play. They can act like a centralized place to collect, store and manage data assets that generate business value.
“Everything gets integrated within the object store. For example, when logging information is collected, it can be stored in the object store where there is no need to worry about scaling to store data and where costs are much lower than traditional on-prem storage solutions,” Factor says. “A lot of this type of data had traditionally been stored in data lakes, which have been based on the Hadoop Distributed File System (HDFS). But this historically put the data and the compute together.”
This colocating of data and compute had unfortunate side effects, however. To scale out a system, another server with perhaps six disk drives would have to be deployed, even if all it needed was more storage capacity. This would happen every time an organization gathered additional data. It would have to add another server with more disk, CPU and memory to accommodate it. Additionally, if an organization had cold, unused data, it would have essentially purchased more compute resources, including servers and memory, than it actually needed. The solution to this was divorcing the data from the compute.
As Factor explains, “This separation was, in part, driven by public clouds and the growth of network speeds. But critically, it’s been driven by the fact that object stores allow for really cost-effective data storage. It currently costs around two cents per gigabyte per month in the public cloud, and in many cases, you can even get it lower depending on how actively you’re using the data. Further, with an object storage in the cloud, there is no need to worry about provisioning physical capacity to store more data; one just consumes capacity as needed and the cloud provider worries about the underlying physical disks.”
Cost savings also come in the form of not having to break the bank on on-premise server and storage hardware. When an organization needs to process data in the cloud, it can simply spin up its required compute resources. Once it’s completed its data processing, it can take the resources down. As a result, the only thing the organization’s paying for over time is the actual space consumed by its data versus having to have a series of servers that might be sitting idle if they’re not doing any computations and just hosting data-storage disks.
Additionally, object stores can hold data in native formats to support big data and analytics tools. Hadoop and Spark jobs can directly access object stores using the S3a or Stocator connectors using the IBM Analytics Engine. IBM itself uses these techniques against IBM Cloud® Object Storage for its operational and analytics needs.
Running big data analytics can be very time-consuming depending on the number of data objects that are in the store. The longer the query runs and the more bytes that need to be processed, the more expensive the analytics. By cutting down the amount of data that's scanned in queries, we can generate huge cost savings and improved system performance.
One of the first steps toward achieving this is the move to data layout designs such as Parquet formatting. Parquet uses a columnar format as opposed to the traditional row-based CSV format. Because the data is laid out as columns, a query using a service such as Cloud SQL Query, or a framework such as Spark, and run against an object can become very specific. If, for example, someone needs access to Columns 1 and 2, they can retrieve the data held only within those columns. This cuts down on unnecessarily long and potentially fruitless reading of data from the object store.
“Say I want data on the blood type of people between the ages of 20 and 30. To process this and get the answer, I need to look at the age and blood type columns. I don’t need to know anything about any of the other columns,” Factor explains. “With formats such as CSV, if I had a table that was 300 GB large, 10 GB for each of my 30 columns, I would have to run through all 300 GB. Now, if I store this in a column-based format like Parquet, I only need to retrieve only 20 GB, 10 for blood type and 10 for age, by assuming that all of the columns are the same size. That’s a huge value and a major reason why people are moving to this format.”
Running in tandem with this, data skipping indexes can help users further improve the performance of their queries, while also improving cost savings. A data skipping index can help a framework such as Spark figure out which data isn't relevant simply skip over it. Using this solution while also putting all the data in an almost infinitely large object store gives users the elasticity to ensure they can actually access the data they want and pass over anything that’s extraneous.
“My team has been working with data they got from The Weather Company. They’re looking at how to conduct very efficient queries over five years’ worth of data that's not sitting in object stores,” Factor says. “Depending on network speeds and the amount of data, it can be a lot to read. Using a data skipping index to read only those objects they specifically require allowed getting value from all of this stored data in a fraction of the time.”
Encryption and Decryption
But what happens if someone wants to encrypt the Parquet columnar format? The default approach is to encrypt the data before it’s stored to disk or cache. Unfortunately, if one wants to decrypt only specific columnar portions of the data, they have to decrypt all of the data, leaving everything potentially exposed. Because of this, the value of being able to retrieve an individual column is lost.
“To address this, the team worked on an enhancement to the Parquet standard,” Factor notes. “We developed the idea of Parquet Modular Encryption, led by IBM, although others certainly contributed to it. The idea we developed is called Parquet Modular Encryption. The notion behind this is that users can define separate keys to be used on each particular column, which are encrypted individually. We have integrated Parquet Modular Encryption with IBM Analytics Engine.”
Now, users can retrieve individual columns and decrypt them, but only if they have the key particular to that column. The administrator can use key access to control which particular columns a user can see. For example, someone might be able to view the contents of Column 2 and 3, but not Column 1, for which somebody else may have been issued a key. Because those columns are isolated and require access to different keys, the entire data set doesn’t have to be decrypted and exposed.
Another way of controlling access involves using IBM Watson® Knowledge Catalog, which enables organizations to define governance policies around the data. This includes defining who can see selective data and which columns need to be redacted before a user can access the data.
A Continual Process
As many organizations are keenly aware, data has indeed changed over the years. It’s no longer letters and numbers being input into traditional databases by data entry clerks. The client server paradigm, followed by the rise of unstructured dat,a fundamentally changed that model, as did the advent of advanced analytics, AI, deep learning and cloud computing. And new data storage methods need to reflect that, including in the evermore complex realm of data security.
“This isn’t a situation where one night we woke up and things had suddenly changed,” Factor remarks. “It’s a continual process, with different data types becoming so significant and undeniable that parts of the underlying data storage infrastructure have to be altered to meet evolving data demands.”
Fortunately, organizations can benefit from these changes. For example, data stores can lower hardware and query costs, improve scalability and elasticity, and increase overall data security. And technologies such as the Parquet format, IBM's Cloud SQL Query Service, Parquet Modular Encryption, IBM Analytics Engine, and Watson Knowledge Catalog only bolster these new storage paradigms.
So You Think Tape Is Dead?
I have heard from many technology experts over the last few years that tape is dead. Well, tape is not dead, but it sure has changed.
From the days of round reels to many, many different types of tape cartridges, we’re down to essentially two types of physical tape used in general data processing. That doesn’t mean tape is dead. Rather, it has evolved. Today, the vast majority of “tape” usage is virtual tape.
Virtual tape is a system that emulates physical on a processing and storage device. The emulation is important because, in many cases, the host writing the “tape” doesn’t know it’s virtual.
Virtual tapes have many advantages over physical tape, not the least of which is the ability to send a copy of that virtual tape to another site over the wire rather than physically carrying it (and possibly losing it). Tape isn’t done evolving. Virtual tape is now moving into the cloud. Cloud-based virtual tape allows processing environments to migrate to the cloud without major process changes.
CTO, Dynamic SolutionsInternational
Chris has been working with computer storage systems for over 40 years.
Jim Utsler, senior writer, has been writing for IBM since the mid-1990s.