Oklahoma State University's Technology Business Assessment Group recently announced it will fund research on an approach to information protection called data shuffling. The project is led by Professor Rathindra Sarathy of OSU's Department of Management Science and Information Systems, who explains to us just what data shuffling is and why it could be coming to your network soon.
Can you give me a quick layman's explanation of data shuffling, then a little more technical one for our readers in IT security? Also, how's it different from encryption?
Data shuffling (US patent: 7200757) belongs to a class of data masking techniques that try to protect confidential, numerical data while retaining the analytical value of the confidential data. Let us say that you want to provide confidential salary data to an analyst. The goal is to try to answer questions such as "Controlling for experience, education and other factors, is there a difference between male and female managers?" or "What are the best predictors of salary among variables such as Age, Sex, Experience, Education, Race, etc.?"
You do not want to provide the original salary data to the analyst, for obvious confidentiality reasons. Even if you remove personally identifiable information before providing the original confidential data, security is not assured since it is usually easy to identify an individual if you know their characteristics. Conventional encryption techniques would not be of value, since the unencrypted original salary is necessary to perform analysis. Hence, one approach is to try to modify the numbers (masking the numbers) before you provide them to the analyst. Data shuffling would intelligently re-assign the original salary numbers such that the results of the analysis come out correct. Simultaneously it prevents you from associating the original salary numbers with the correct individuals. The real power of data shuffling shows up when you want to maintain complicated relationships among several variables, including both confidential and nonconfidential, such as in the second question above.
Data shuffling isn't something we've written about, though I do see a fair number of references to it on the Web. Do you have a sense of how hot a concept this is now?
Several researchers are working on data masking concepts. Data shuffling is a particular method of data masking that we have patented. We believe that it has strong potential. Unfortunately, organizations have not realized the power of data shuffling and the potential benefits that come from using this approach. Our main thrust in the next two years will be to educate and promote the benefits of data shuffling.
As for commercial products, there are a couple of data masking products in the marketplace. But, unlike data shuffling, they provide fairly simplistic situations. As a result, the masked data does not offer the same quality assurance that data shuffling provides.
I saw a presentation you did that focused on protecting data in healthcare settings. Is that where you see data shuffling taking hold initially, or what other vertical markets do you think are especially good fits?
Healthcare is definitely one of our current focus areas, but there are many other applications such as insurance claims data or other types of financial analysis applications where it can be useful. In fact, data shuffling can be used in any situation where an organization wishes to analyze or share any confidential data.