Use Machine Learning to Make Storage Smarter
With the goal of helping us learn more about the universe, IBM and the Netherlands Institute for Radio Astronomy (ASTRON) are collaborating on the IT roadmap behind the Square Kilometre Array (SKA), which will be the largest radio telescope in the world (bit.ly/2iywTpL).
As part of this effort, called the DOME project, IBM developed a pizza-box-sized data center that uses a fraction of the energy required by a typical data center to lower SKA computing costs (bit.ly/2jWgHut). Now, IBM is working on the best way to access and analyze the petabytes of data the SKA will gather daily.
“If you put everything on flash drives, you’re going to quickly eat up your IT budget. You must be smart about what you store long term versus short term, and the storage medium.”
—Giovanni Cherubini, data storage scientist, IBM
IBM Systems Magazine sat down with Vinodh Venkatesan and Giovanni Cherubini, IBM data storage scientists, who explain why the best way likely involves cognitive storage.
Inspired by the human brain, cognitive storage uses a combination of both data popularity and value. This means users can teach storage systems what to remember and what to forget, which can significantly lower storage costs, whether cognitive storage is used to scan the universe or track customer behavior.
IBM Systems Magazine (ISM): How is the DOME project related to cognitive storage?
Vinodh Venkatesan (VV): Our DOME project partner, ASTRON, is part of a consortium that’s designing and building the SKA. When it’s finished by 2024, one of the main challenges will be the amount of data ASTRON is planning to collect—close to a petabyte a day. At these data ingestion rates, the storage system cost goes through the roof. Within this data, astronomers may find answers to some fundamental questions about the universe, but this won’t be possible if the system is too expensive to operate.
Some parts of the data collected are more valuable than others. This got our IBM Research team thinking about how our brains work. We collect a lot of information every day with our eyes, ears and so on, but we don’t remember everything. We remember only what seems important or relevant, and forget things that aren’t. Our brain automatically does this classification.
We thought, “Why not apply these principles in a large-scale data storage system? Why not, as data comes into the system, analyze it, decide whether it’s likely important and then choose what type of medium to store it on—whether it’s flash, hard drives or tape—and the type of redundancies involved, such as how many backups you need?” Then, the system performance and reliability is optimized to take into account data values.
ISM: How does this differ from current storage systems?
Giovanni Cherubini (GC): Current systems already do some of this optimization, with respect to data popularity; they keep track of how frequently each file or piece of data is accessed. They might move frequently accessed data—hot data—to faster devices such as flash, or they might move data to other storage media as it becomes colder. We’re looking at adding this new notion of data value because, although some correlation might exist between the value and popularity of the same piece of data, not all data that’s popular is valuable and vice versa.
Therefore, we can add what we call the computing/analytics units and selector, which includes a learning system that can be used to teach the system what’s important and what’s not.
For example, if a user is presented a sample of the files he owns, he can say, “These files are important. These files are not so important.” Then the learning system looks at the file content and metadata to see whether patterns exist between the file metadata, and what their importance is—it’s essentially teaching itself. Whenever a new file comes in, the system looks at its metadata and learns to detect similar patterns identified as being important or not important. This would potentially create an estimation of the data value. That’s then used by the selector, which decides on the type of storage policy to implement on a piece of new data with a certain estimated data value, what medium to store it on and the type of redundancy that should be involved.
Another important part of this system is the feedback loop, which aims to periodically reassess data value to determine if it had been potentially misclassified as important. It can also track changes in data value over time. Different types of data, some of which is called must-keep data, may be critical to a business and must be reliably stored for extended periods of time. Other types of data with an expiration date—data collected for regulatory reasons for a certain number of years—may exist.
Business or experimental data is another type where we think cognitive storage might be most appropriate. Businesses collect this all the time, and it usually has immediate value. For example, data collected by an online store about what’s in your shopping cart can be used for billing, checking out, shipping and so on. But it might also have value later, in the sense that data analytics can be performed on shopping cart data from multiple consumers over long periods of time to gain insight into which customers buy what, market identification, etc. This is how data values can change over time, so reassessments come into play. This is a key aspect of our cognitive storage concept.
ISM: As part of your paper, “Cognitive Storage for Big Data” in IEEE Computer magazine, you did a cognitive storage demonstration that was nearly 100 percent accurate. What was involved?
VV: Our test server had about 1.7 million files on it, and we asked our colleagues to label the files as belonging to one project or another. The users classified the files into three projects. Then metadata was collected from the file system: file size, extension and name, the path to the file in the directory structure, the user who owns the file, the group to which the user belongs, access permission for the file, and so on.
We fed this information to a supervised machine-learning algorithm called information bottleneck. It knows the metadata corresponding to the file, the file size, the file’s owner, what the extension is and what project it belongs to. Of the 1.7 million files we had, approximately 170,000 files were used to train the system. All 170,000 files were given labels for each project they belonged to, and the learning algorithm searched for patterns in the metadata that would indicate which project it belonged to. Keep in mind that one of the projects had only 157 files out of the total of 1.7 million files, while others had hundreds of thousands of files.
We chose 10 percent of the files from each of these projects to train the system using information bottleneck. Then we took the remainder of the files and asked the system to predict, based on what it learned, which project it thought a new file (that it hadn’t seen before) belonged to. Because we had the project information for all 1.7 million files, we could compare the algorithm’s prediction to the actual project the file belonged to and estimate whether it was correct. You could think of finding different values for each of these projects, so one project could be more valued than others. Then you would naturally estimate how valuable a file is compared with another file, based on what projects exist.
For ASTRON, the data sets are basically observations of the skies, so we have metadata that gives us information about the project investigator, who conducted an observation, what part of the sky investigators are looking at, what forces and galaxies were observed—all sorts of information. But this idea has applications beyond astronomy. Any large enterprise with big data could benefit. One reason for publishing the paper (bit.ly/2ixbOYH) was to interest enterprises in beta testing so we can fine-tune the cognitive storage system to find the system value of data in an enterprise context.
ISM: Do you think a seamless integration will occur between cognitive storage and cognitive computing?
GC: Yes, we think this technology integrates very well with IBM Watson* technology. For example, some APIs already available through Watson technology could be used for performing some of the analytics function in cognitive storage. Think about the Internet of Things in terms of data collection, such as weather or traffic. You can’t manually tackle that anymore—too much data exists, and it has different values—so you need some assistance in knowing what to file and where to file it. If you put everything on flash drives, you’re going to quickly eat up your IT budget. You must be smart about what you store long term versus short term, and the storage medium.
ISM: When do you think this technology will be available for clients?
GC: If we have a prototype in a year or less, another year or so will probably be needed to move to the next level. Meanwhile, we could start beta testing with some partners. Maybe two or three years down the road, we’ll have a product that’ll be available for clients.