





| Papers Abstracts Citations Patents |
Abstract: Literature on information integration across databases tacitly assumes that the data in each database can be revealed to the other databases. However, there is an increasing need for sharing information across autonomous entities in such a way that no information apart from the answer to the query is revealed. We formalize the notion of minimal information sharing across private databases, and develop protocols for intersection, equijoin, intersection size, and equijoin size. We also show how new applications can be built using the proposed protocols.
Abstract:
There has been increasing interest in the problem of building
accurate data mining models over aggregate data, while protecting
privacy at the level of individual records. One approach for this
problem is to randomize the values in individual records, and only
disclose the randomized values. The model is then built over the
randomized data, after first compensating for the randomization (at
the aggregate level). This approach is potentially vulnerable to
privacy breaches: based on the distribution of the data, one may be
able to learn with high confidence that some of the randomized
records satisfy a specified property, even though privacy is
preserved on average.
In this paper, we present a new formulation of privacy breaches,
together with a methodology, ``amplification'', for limiting
them. Unlike earlier approaches, amplification makes it is possible
to guarantee limits on privacy breaches without any knowledge of the
distribution of the original data. We instantiate this methodology
for the problem of mining association rules, and modify the
algorithm from ESAG02 to limit
privacy breaches without knowledge of the data distribution. Next,
we address the problem that the amount of randomization required to
avoid privacy breaches (when mining association rules) results in
very long transactions. By using pseudorandom generators and
carefully choosing seeds such that the desired items from the
original transaction are present in the randomized transaction, we
can send just the seed instead of the transaction, resulting in a
dramatic drop in communication and storage cost. Finally, we define
new information measures that take privacy breaches into account
when quantifying the amount of privacy preserved by randomization.
Abstract: The Platform for Privacy Preferences (P3P) is the most significant effort currently underway to enable web users to gain control over their private information. The designers of P3P simultaneously designed a preference language called APPEL to allow users to express their privacy preferences, thus enabling automatic matching of privacy preferences against P3P policies. Unfortunately subtle interactions between P3P and APPEL result in serious problems when using APPEL: Users can only directly specify what is unacceptable in a policy, not what is acceptable; simple preferences are hard to express; and writing APPEL preferences is error prone. We show that these problems follow from a fundamental design choice made by APPEL, and cannot be solved without completely redesigning the language. Therefore we explore alternatives to APPEL that can overcome these problems. In particular, we showthat XPath serves quite nicely as a preference language and solves all the above problems. We identify the minimal subset of XPath that is needed, thus allowing matching programs to potentially use a smaller memory footprint. We also give an APPEL to XPath translator that shows that XPath is as expressive as APPEL.
Abstract: Recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy informa-tion than text. We investigate the feasibility of applying link-based methods in new applications domains. The specific application we consider is to partition authors into opposite camps within a given topic in the context of newsgroups. A typical newsgroup posting consists of one or more quoted lines from another posting followed by the opinion of the author. This social behavior gives rise to a network in which the vertices are individuals and the links represent "responded-to" relationships. An interesting characteristic of many newsgroups is that people more frequently respond to a message when they disagree than when they agree. This behavior is in sharp contrast to the WWW link graph, where linkage is an indicator of agreement or common interest. By analyzing the graph structure of the responses, we are able to effectively classify people into opposite camps. In contrast, methods based on statistical analysis of text yield low accuracy on such datasets because the vocabulary used by the two sides tends to be largely identical, and many newsgroup postings consist of relatively few words of text.
Abstract: Platform for Privacy Preferences (P3P) is the most significant effort currently underway to enable web users to gain control over their private information. P3P provides mechanisms for web site owners to express their privacy policies in a standard format that a user can programmatically check against her privacy preferences to decide whether to release her data to the web site. We discuss architectural alternatives for implementing P3P and present a server-centric implementation that reuses database querying technology, as opposed to the prevailing client-centric implementations based on specialized engines. Not only does the proposed implementation have qualitative advantages, our experiments indicate that it performs significantly better than the sole public-domain client-centric implementation and that the latency introduced by preference matching is small enough for real-world deployments of P3P.
Abstract: The Hippocratic Oath has guided the conduct of physicians for centuries. Inspired by its tenet of preserving privacy, we argue that future database systems must include responsibility for the privacy of data they manage as a founding tenet. We enunciate the key privacy principles for such Hippocratic database systems. We propose a strawman design for Hippocratic databases, identify the technical challenges and problems in designing such databases, and suggest some approaches that may lead to solutions. Our hope is that this paper will serve to catalyze a fruitful and exciting direction for future database research.
Abstract: We present a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy using a straightforward ``uniform'' randomization, the discovered rules can unfortunately be exploited to find privacy breaches. We analyze the nature of privacy breaches and propose a class of randomization operators that are much more effective than uniform randomization in limiting the breaches. We derive formulae for an unbiased support estimator and its variance, which allow us to recover itemset supports from randomized datasets, and show how to incorporate these formulae into mining algorithms. Finally, we present experimental results that validate the algorithm by applying it on real datasets.
Abstract: A large fraction of the useful web comprises of specification documents that largely consist of <attribute name, numeric value> pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first establishing correspondences between values and their names has achieved limited success because of the difficulty of extracting this information from free text. We propose a new approach that does not require this correspondence to be accurately established. Provided the data has ``low reflectivity'', we can do effective search even if the values in the data have not been assigned attribute names and the user has omitted attribute names in the query. We give algorithms and indexing structures for implementing the search. We also show how hints (i.e., imprecise, partial correspondences) from automatic data extraction techniques can be incorporated into our approach for better accuracy on high reflectivity datasets. Finally, we validate our approach by showing that we get high precision in our answers on real datasets from a variety of domains.
Abstract: We address the problem of integrating documents from different sources into a master catalog. This problem is pervasive in web marketplaces and portals. Current technology for automating this process consists of building a classifier that uses the categorization of documents in the master catalog to construct a model for predicting the category of unknown documents. Our key insight is that many of the data sources have their own categorization, and classification accuracy can be improved by factoring in the implicit information in these source categorizations. We show how a Naive Bayes classification can be enhanced to incorporate the similarity information present in source catalogs. Our analysis and empirical evaluation show substantial improvement in the accuracy of catalog integration.
Abstract: Many websites have a hierarchical organization of content. This organization may be quite different from the organization expected by visitors to the website. In particular, it is often unclear where a specific document is located. In this paper, we propose an algorithm to automatically find pages in a website whose location is different from where visitors expect to find them. The key insight is that visitors will backtrack if they do not find the information where they expect it: the point from where they backtrack is the expected location for the page. We present an algorithm for discovering such expected locations that can handle page caching by the browser. Expected locations with a significant number of hits are then presented to the website administrator. We also present algorithms for selecting expected locations (for adding navigation links) to optimize the benefit to the website or the visitor. We ran our algorithm on the Wharton business school website and found that even on this small website, there were many pages with expected locations different from their actual location.
Abstract:
A fruitful direction for future data mining research will be the
development of techniques that incorporate privacy concerns.
Specifically, we address the following question. Since the primary task
in data mining is the development of models about aggregated data, can
we develop accurate models without access to precise information in
individual data records?
We consider the concrete case of building a decision-tree classifier
from training data in which the values of individual records have
been perturbed. The resulting data records look very different from
the original records and the distribution of data values is also
very different from the original distribution. While it is not
possible to accurately estimate original values in individual data
records, we propose a novel reconstruction procedure to accurately
estimate the distribution of original data values. By using these
reconstructed distributions, we are able to build classifiers whose
accuracy is comparable to the accuracy of classifiers built with the
original data.
Abstract: We describe Athena: a system for creating, exploiting, and maintaining a hierarchical arrangement of textual documents through interactive mining-based operations. Requirements of any such system include speed and minimal end-user effort. Athena satisfies these requirements through linear-time classification and clustering engines which are applied interactively to speed the development of accurate models.
Naive Bayes classifiers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classifier is considerably more accurate (7 to 29% absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, under-weighting long documents, and over-weighting author and subject.
We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. C-Evolve first finds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classification algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves considerably higher clustering accuracy (10 to 20% absolute increase in our experiments) than the popular K-Means and agglomerative clustering methods.
Abstract: Association rule algorithms can produce a very large number of output patterns. This has raised questions of whether the set of discovered rules ``overfit'' the data because all the patterns that satisfy some constraints are generated (the Bonferroni effect). In other words, the question is whether some of the rules are ``false discoveries'' that are not statistically significant. We present a novel approach for estimating the number of ``false discoveries'' at any cutoff level. Empirical evaluation shows that on typical datasets the fraction of rules that may be false discoveries is very small. A bonus of this work is that the statistical significance measures we compute are a good basis for ordering the rules for presentation to users, since they correspond to the statistical ``surprise'' of the rule. We also show how to compute confidence intervals for the support and confidence of an association rule, enabling the rule to be used predictively on future data.
Abstract: The problem of discovering association rules has received considerable research attention and several fast algorithms for mining association rules have been developed. In practice, users are often interested in a subset of association rules. For example, they may only want rules that contain a specific item or rules that contain children of a specific item in a hierarchy. While such constraints can be applied as a post-processing step, integrating them into the mining algorithm can dramatically reduce the execution time. We consider the problem of integrating constraints that are boolean expressions over the presence or absence of items into the association discovery algorithm. We present three integrated algorithms for mining association rules with item constraints and discuss their tradeoffs.
Many real-life problems require a partial classification of the data. We use the term ``partial classification'' to describe the discovery of models that show characteristics of the data classes, but may not cover all classes and all examples of any given class. Complete classification may be infeasible or undesirable when there are a very large number of class attributes, most attributes values are missing, or the class distribution is highly skewed and the user is interested in understanding the low-frequency class. We show how association rules can be used for partial classification in such domains, and present two case studies: reducing telecommunications order failures and detecting redundant medical tests.
Abstract: We describe a system we developed for identifying trends in text documents collected over a period of time. Trends can be used, for example, to discover that a company is shifting interests from one domain to another. Our system uses several data mining techniques in novel ways and demonstrates a method in which to visualize the trends. We also give experiences from applying this system to the IBM Patent Server, a database of U.S. patents.
Abstract: A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL.
For range-sum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at run-time. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the sub-cube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/(b^d) of the size of the d-dimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary information, but the overall time complexity is typically reduced significantly. We also discuss how the precomputed information is incrementally updated by batching updates to the data cube. Finally, we present algorithms for choosing the subset of the data cube dimensions for which the auxiliary information is computed and the blocking factor to use for each such subset.
Our approach to answering range-max queries is based on precomputed max over balanced hierarchical tree structures. We use a branch-and-bound-like procedure to speed up the finding of max in a region. We also show that with a branch-and-bound procedure, the average-case complexity is much smaller than the worst-case complexity.
Abstract: We present a new index structure, called the epsilon-K-D-B tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence the index scales to high-dimensional data. Empirical evaluation, using synthetic and real-life datasets, shows the join time for the epsilon-K-D-B tree is 3 to 10 times less than the join time for the R+ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the epsilon-K-D-B tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the epsilon-K-D-B tree.
Abstract: We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be ``10% of married people between age 50 and 60 have at least 2 cars''. We deal with quantitative attributes by fine-partitioning the values of the attribute and then combining adjacent partitions as necessary. We introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. We tackle this problem by using a ``greater-than-expected-value'' interest measure to identify the interesting rules in the output. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a real-life dataset.
Abstract:
Given a database of sequences, where each sequence is a list of
transactions ordered by transaction-time, and each transaction is a
set of items, the problem of mining sequential patterns is to
discover all sequential patterns with a user-specified minimum
support, where the support of a pattern is the number of
data-sequences that contain the pattern. An example of a sequential
pattern is ``5% of customers bought `Foundation' and `Ringworld' in
one transaction, followed by `Second Foundation' in a later
transaction''. We generalize the problem as follows. First, we add
time constraints that specify a minimum and/or maximum time period
between adjacent elements in a pattern. Second, we relax the
restriction that the items in an element of a sequential pattern
must come from the same transaction, instead allowing the items to
be present in a set of transactions whose transaction-times are
within a user-specified time window. Third, given a user-defined
taxonomy (is-a hierarchy) on items, we allow sequential
patterns to include items across all levels of the taxonomy.
We present GSP, a new algorithm that discovers these generalized
sequential patterns. Empirical evaluation using synthetic and
real-life data indicates that GSP is much faster than the earlier
AprioriAll algorithm. GSP scales linearly with the
number of data-sequences, and has very good scale-up properties with
respect to the average data-sequence size.
Abstract: We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy ( is-a hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy that says that jackets is-a outerwear is-a clothes, we may infer a rule that ``people who buy outerwear tend to buy shoes''. This rule may hold even if rules that ``people who buy jackets tend to buy shoes'', and ``people who buy clothes tend to buy shoes'' do not hold. An obvious solution to the problem is to add all ancestors of each item in a transaction to the transaction, and then run any of the algorithms for mining association rules on these ``extended transactions''. However, this ``Basic'' algorithm is not very fast; we present two algorithms, Cumulate and EstMerge, which run 2 to 5 times faster than Basic (and more than 100 times faster on one real-life dataset). We also present a new interest-measure for rules which uses the information in the taxonomy. Given a user-specified ``minimum-interest-level'', this measure prunes a large number of redundant rules; 40% to 60% of all the rules were pruned on two real-life datasets.
Abstract: We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction.
Abstract: We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database.