New grant aims to create better algorithms to manage big data by getting “non-real”
Massive amounts of data are being generated, transmitted, received, processed, and stored at an unprecedented scale every day. The sheer amount of the data available is outstripping our ability to make use of it.
Professors Laura Balzano and Hessam Mahdavifar are developing new ways to compress data as part of an investment from the DOE’s Office of Advanced Scientific Computing Research. This research is expected to yield new insights at the frontiers of physics, chemistry, biology, and other domains, such as energy systems and genomics.
The work they are doing involves generating “randomized algorithms,” described by the DOE as algorithms that include some form of sampling or randomness in their approach for dealing with massive data, enabling predictive modeling and simulation, and carrying out scientific analysis. The project is called, “Get Non-Real: Randomized Sketching for High-Dimensional Non-Real-Valued Data.”
We asked Balzano and Mahdavifar to describe their research with some level of detail.
Q&A with Laura Balzano and Hessam Mahdavifar
How would you describe this project?
Massive amounts of data are being generated, transmitted, received, processed, and stored at an unprecedented scale every day. When these data are too big to even be stored, we need to compress the data by removing redundancy.
Innovative compression algorithms are revolutionizing the data processing pipeline by allowing significant compression of redundant information.
In this project, we will investigate a powerful class of such compression algorithms, namely, randomized sketching algorithms, in new settings involving data types beyond the typical assumptions that data entries are real-valued.
What is randomized sketching?
In general, sketching algorithms can be seen as forms of compression that map original data to sketched data that require substantially smaller space to be stored or communication bandwidth to be transmitted, while preserving certain pre-specified properties of data.
In general, randomized algorithms are algorithms that employ randomness in their procedure. This means that if we run the algorithm multiple times with the same input, we may end up with a different result every time.
We use randomization to create methods that work well on a lot of different datasets. They are known to be powerful methods in a wide range of scientific domains including all aspects of data processing, storage, and communications.
What do you mean by real-valued data?
Real-valued data is a dataset where each entry belongs to the set of real numbers. In other words, each entry in real-valued data can take any number from minus infinity to plus infinity in a continuous fashion. That includes irrational numbers, which are impossible to store on a computer. So while the real-valued assumption is useful for mathematical theory, it is not practical for real-LIFE data.
What are some examples of non-real-valued data?
Non-real-valued data includes the integers (… -2, -1, 0, 1, 2, 3 …) or binary numbers (0 and 1), ordinal information (ranking items in a list), categorical data (the state where you live, it takes one of 50 possible values). Many data sensing and collecting modalities in scientific and technological applications are best modeled mathematically as data types other than real-valued data, as they might be constrained in terms of the range as well as specific values they can take.
What do you mean by high-dimensional data, and what are some examples and applications?
Each data sample is often represented by a sequence of entries or features and such sequences are also referred to as vectors in a mathematical setting. And vectors can be often considered as elements of a vector space with a dimension that is equal to the length of the vector, i.e., the number of features. And high-dimensional means that the number of dimensions/features are much higher than what our systems can typically process or store. For example, in smart grids involving sensor networks of smart meters, it is estimated that 100 billions of readings are collected each day globally in order to achieve fine-grain monitoring and scheduling for power grid systems. In another example, for medical imaging, a CT scan collects 559M measurements in order to reconstruct the image of the inside of your body. In genomics, a single individual’s genome requires 3B bases to be stored (A,C,G,T).
In general, it is often very difficult, if possible at all, to process and store the raw high-dimensional data. Therefore, it would be much more efficient to map data to low-dimensional structures in order to simplify the computational algorithms involving the data and reducing the cost for storing it or transmitting/receiving it.
What do you mean by low-dimensional structures, and what are some examples and applications?
Often times, although the data is high-dimensional, there are a lot of dependencies between different features of the data making it possible to represent data in a low-dimensional space in such a way that the low-dimensional representation preserves the most meaningful properties of the original data. Low-dimensional representations of data is also referred to as a low-dimensional structure.
This is true in smart grids where smart meters take records in short intervals and majority of such records are almost identical over extended periods of times. Similarly, this is true in medical imaging, where most of an image is water or empty space between organs, and images of organs have repeated textures and patterns. Also in genomics, while one genome requires 3B bases, among several humans 99% of the genome is shared.
What are some of the potential applications for your research?
The potential applications are vast and encompass almost any everyday scenario involving large volumes of data including healthcare, retail and customer service, and banking as well as more specific engineering and science applications such as genetics, neuroscience, and environmental sensing.
For example, in environmental sensing applications involving sensor networks, the data transmitted and collected by the dispersed sensors is usually quantized, due to bandwidth and energy constraints, before they are sent over the network. The quantized data fall into the category of non-real-valued data, necessitating novel techniques for extracting the low-directional structures from the massively generated data by these networks.
Other examples include weather monitoring systems, swarm robotics, smart grids, among others, that involve networks of nodes that need to continuously share real-time data, infer states, and arrive at consensus for local decision-making. In such scenarios, sketching can provide efficient compression of quantized data generated at each node.
Speaking more fundamentally, this project has the potential to expand the current understanding of fundamental limits of compression and to significantly reduce the cost of storing and processing massive datasets. We are very excited about this project!
Previous announcement
“Get non-Real”: Department of Energy grant funds novel research in High-Performance Algorithms at U-M (by Mariana Carrosco-Teja, Michigan Institute for Computational Discovery & Engineering)