





Abstract: In the first half of the talk, I will discuss data mining technologies that can result in better browsing and searching. Consider the problem of merging documents from different categorizations (taxonomies) into a single master categorization. Current classifiers ignore the implicit similarity information present in the source categorizations. I will show that by incorporating this information into the classification model, classification accuracy can be substantially improved. Next, I will demonstrate novel search technology that treats numbers as first-class objects, and thus yields dramatically better results than current web search engines when searching over product descriptions or other number-rich documents. In the second half of the talk, I will cover the exciting new research area of privacy preserving data mining, which allows us to to build accurate data mining models without access to precise information in individual data records, thus finessing the potential conflict between privacy and data mining.
Abstract: The goal of privacy preserving data mining is to develop accurate models without access to precise information in individual data records, thus finessing the conflict between privacy and data mining. In this talk, I will give an introduction to the techniques underlying privacy preserving data mining, and then discuss several application domains. In particular, recent events have led to an increased interest in applying data mining toward security related problems, leading to interesting technical challenges at the intersection of privacy, security and data mining.
Abstract: Electronic commerce, the web, and privacy concerns have led to interesting new directions for data mining research. Electronic commerce applications have given rise to some interesting twists on traditional data mining problems such as classification and information extraction, as well as posing new challenges. Growing concerns about privacy have led to an exciting new area of research: privacy preserving data mining. The goal of privacy preserving data mining is to develop accurate models without access to precise information in individual data records, thus resolving the potential conflict between privacy and data mining. I will give an overview of current research in this area, and present several challenging open problems.
Abstract: A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question: can we develop accurate models over aggregated data while preserving privacy at the level of individual data records?
To illustrate the idea of privacy-preserving data mining, we consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel Bayesian reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is close to the accuracy of classifiers built with the original data.
Next, we focus on two temporal mining problems: mining sequential patterns, and discovering trends over time. We discuss the applicability of the above techniques to these problems, and show that protecting privacy at the individual level while still discovering sequential patterns and trends is a challenging open research problem.
Abstract: The problem of mining association rules has received considerable research and industry attention, with widespread applications ranging from market basket analysis to detecting redundant medical tests. The original problem formulation has been extended in many directions, including the incorporation of taxonomies, quantitative associations, long associations, sequential patterns, and finding only those associations that satisfy user-specified constraints. There has also been much work on fast algorithms for both the original formulation and the above variations. In this talk, I will describe the problem and the most popular algorithmic approach to solving it, followed by an overview of the various extensions and current work on the topic, and conclude by presenting some open research topics.
Abstract: A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records?
We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.