HomePublicationsTalksBioPhotosLinks
Papers       Abstracts       Citations       Patents

Abstracts

  1. R. Agrawal, A. Evfimievski and R. Srikant, "Information Sharing across Private Databases", Proc. of the ACM SIGMOD Conference on Management of Data, San Diego, California, June 2003.

    Abstract: Literature on information integration across databases tacitly assumes that the data in each database can be revealed to the other databases. However, there is an increasing need for sharing information across autonomous entities in such a way that no information apart from the answer to the query is revealed. We formalize the notion of minimal information sharing across private databases, and develop protocols for intersection, equijoin, intersection size, and equijoin size. We also show how new applications can be built using the proposed protocols.


  2. A. Evfimievski, J. Gehrke and R. Srikant, "Limiting Privacy Breaches in Privacy Preserving Data Mining", Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Diego, California, June 2003.

    Abstract: There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.
    In this paper, we present a new formulation of privacy breaches, together with a methodology, ``amplification'', for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from ESAG02 to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.


  3. R. Agrawal, J. Kiernan, R. Srikant and Y. Xu, "An XPath-based Preference Language for P3P", to appear in Proc. of the Twelfth Int'l World Wide Web Conference, Budapest, Hungary, May 2003.

    Abstract: The Platform for Privacy Preferences (P3P) is the most significant effort currently underway to enable web users to gain control over their private information. The designers of P3P simultaneously designed a preference language called APPEL to allow users to express their privacy preferences, thus enabling automatic matching of privacy preferences against P3P policies. Unfortunately subtle interactions between P3P and APPEL result in serious problems when using APPEL: Users can only directly specify what is unacceptable in a policy, not what is acceptable; simple preferences are hard to express; and writing APPEL preferences is error prone. We show that these problems follow from a fundamental design choice made by APPEL, and cannot be solved without completely redesigning the language. Therefore we explore alternatives to APPEL that can overcome these problems. In particular, we showthat XPath serves quite nicely as a preference language and solves all the above problems. We identify the minimal subset of XPath that is needed, thus allowing matching programs to potentially use a smaller memory footprint. We also give an APPEL to XPath translator that shows that XPath is as expressive as APPEL.


  4. R. Agrawal, S. Rajagopalan, R. Srikant and Y. Xu, "Mining Newsgroups Using Networks Arising From Social Behavior", to appear in Proc. of the Twelfth Int'l World Wide Web Conference, Budapest, Hungary, May 2003.

    Abstract: Recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy informa-tion than text. We investigate the feasibility of applying link-based methods in new applications domains. The specific application we consider is to partition authors into opposite camps within a given topic in the context of newsgroups. A typical newsgroup posting consists of one or more quoted lines from another posting followed by the opinion of the author. This social behavior gives rise to a network in which the vertices are individuals and the links represent "responded-to" relationships. An interesting characteristic of many newsgroups is that people more frequently respond to a message when they disagree than when they agree. This behavior is in sharp contrast to the WWW link graph, where linkage is an indicator of agreement or common interest. By analyzing the graph structure of the responses, we are able to effectively classify people into opposite camps. In contrast, methods based on statistical analysis of text yield low accuracy on such datasets because the vocabulary used by the two sides tends to be largely identical, and many newsgroup postings consist of relatively few words of text.


  5. R. Agrawal, J. Kiernan, R. Srikant and Y. Xu, "Implementing P3P using Database Technology", Proc. of the 19th Int'l Conference on Data Engineering, Bangalore, India, March 2003.

    Abstract: Platform for Privacy Preferences (P3P) is the most significant effort currently underway to enable web users to gain control over their private information. P3P provides mechanisms for web site owners to express their privacy policies in a standard format that a user can programmatically check against her privacy preferences to decide whether to release her data to the web site. We discuss architectural alternatives for implementing P3P and present a server-centric implementation that reuses database querying technology, as opposed to the prevailing client-centric implementations based on specialized engines. Not only does the proposed implementation have qualitative advantages, our experiments indicate that it performs significantly better than the sole public-domain client-centric implementation and that the latency introduced by preference matching is small enough for real-world deployments of P3P.


  6. R. Agrawal, J. Kiernan, R. Srikant and Y. Xu, "Hippocratic Databases", Proc. of the 28th Int'l Conference on Very Large Data Bases, Hong Kong, China, August 2002.

    Abstract: The Hippocratic Oath has guided the conduct of physicians for centuries. Inspired by its tenet of preserving privacy, we argue that future database systems must include responsibility for the privacy of data they manage as a founding tenet. We enunciate the key privacy principles for such Hippocratic database systems. We propose a strawman design for Hippocratic databases, identify the technical challenges and problems in designing such databases, and suggest some approaches that may lead to solutions. Our hope is that this paper will serve to catalyze a fruitful and exciting direction for future database research.


  7. A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke, "Privacy Preserving Mining of Association Rules", Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery in Databases and Data Mining, Edmonton, Canada, July 2002.

    Abstract: We present a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy using a straightforward ``uniform'' randomization, the discovered rules can unfortunately be exploited to find privacy breaches. We analyze the nature of privacy breaches and propose a class of randomization operators that are much more effective than uniform randomization in limiting the breaches. We derive formulae for an unbiased support estimator and its variance, which allow us to recover itemset supports from randomized datasets, and show how to incorporate these formulae into mining algorithms. Finally, we present experimental results that validate the algorithm by applying it on real datasets.


  8. R. Agrawal and R. Srikant, "Searching with Numbers", Proc. of the Eleventh Int'l World Wide Web Conference, Honolulu, Hawaii, May 2002.

    Abstract: A large fraction of the useful web comprises of specification documents that largely consist of <attribute name, numeric value> pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first establishing correspondences between values and their names has achieved limited success because of the difficulty of extracting this information from free text. We propose a new approach that does not require this correspondence to be accurately established. Provided the data has ``low reflectivity'', we can do effective search even if the values in the data have not been assigned attribute names and the user has omitted attribute names in the query. We give algorithms and indexing structures for implementing the search. We also show how hints (i.e., imprecise, partial correspondences) from automatic data extraction techniques can be incorporated into our approach for better accuracy on high reflectivity datasets. Finally, we validate our approach by showing that we get high precision in our answers on real datasets from a variety of domains.


  9. R. Agrawal and R. Srikant, "On Integrating Catalogs", Proc. of the Tenth Int'l World Wide Web Conference, Hong Kong, May 2001.

    Abstract: We address the problem of integrating documents from different sources into a master catalog. This problem is pervasive in web marketplaces and portals. Current technology for automating this process consists of building a classifier that uses the categorization of documents in the master catalog to construct a model for predicting the category of unknown documents. Our key insight is that many of the data sources have their own categorization, and classification accuracy can be improved by factoring in the implicit information in these source categorizations. We show how a Naive Bayes classification can be enhanced to incorporate the similarity information present in source catalogs. Our analysis and empirical evaluation show substantial improvement in the accuracy of catalog integration.


  10. R. Srikant and Y. Yang, "Mining Web Logs to Improve Website Organization", Proc. of the Tenth Int'l World Wide Web Conference, Hong Kong, May 2001.

    Abstract: Many websites have a hierarchical organization of content. This organization may be quite different from the organization expected by visitors to the website. In particular, it is often unclear where a specific document is located. In this paper, we propose an algorithm to automatically find pages in a website whose location is different from where visitors expect to find them. The key insight is that visitors will backtrack if they do not find the information where they expect it: the point from where they backtrack is the expected location for the page. We present an algorithm for discovering such expected locations that can handle page caching by the browser. Expected locations with a significant number of hits are then presented to the website administrator. We also present algorithms for selecting expected locations (for adding navigation links) to optimize the benefit to the website or the visitor. We ran our algorithm on the Wharton business school website and found that even on this small website, there were many pages with expected locations different from their actual location.


  11. R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining", Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000.

    Abstract: A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records?
    We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.


  12. R. Agrawal, R.J. Bayardo and R. Srikant, "Athena: Mining-based Interactive Management of Text Databases", Proc. of the Seventh Int'l Conf. on Extending Database Technology (EDBT), Konstanz, Germany, March 2000.

    Abstract: We describe Athena: a system for creating, exploiting, and maintaining a hierarchical arrangement of textual documents through interactive mining-based operations. Requirements of any such system include speed and minimal end-user effort. Athena satisfies these requirements through linear-time classification and clustering engines which are applied interactively to speed the development of accurate models.

    Naive Bayes classifiers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classifier is considerably more accurate (7 to 29% absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, under-weighting long documents, and over-weighting author and subject.

    We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. C-Evolve first finds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classification algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves considerably higher clustering accuracy (10 to 20% absolute increase in our experiments) than the popular K-Means and agglomerative clustering methods.


  13. N. Megiddo, R. Srikant, "Discovering Predictive Association Rules", Proc. of the 4th Int'l Conference on Knowledge Discovery in Databases and Data Mining, New York, August 1998.

    Abstract: Association rule algorithms can produce a very large number of output patterns. This has raised questions of whether the set of discovered rules ``overfit'' the data because all the patterns that satisfy some constraints are generated (the Bonferroni effect). In other words, the question is whether some of the rules are ``false discoveries'' that are not statistically significant. We present a novel approach for estimating the number of ``false discoveries'' at any cutoff level. Empirical evaluation shows that on typical datasets the fraction of rules that may be false discoveries is very small. A bonus of this work is that the statistical significance measures we compute are a good basis for ordering the rules for presentation to users, since they correspond to the statistical ``surprise'' of the rule. We also show how to compute confidence intervals for the support and confidence of an association rule, enabling the rule to be used predictively on future data.


  14. R. Srikant, Q. Vu, R. Agrawal, "Mining Association Rules with Item Constraints", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, August, 1997.

    Abstract: The problem of discovering association rules has received considerable research attention and several fast algorithms for mining association rules have been developed. In practice, users are often interested in a subset of association rules. For example, they may only want rules that contain a specific item or rules that contain children of a specific item in a hierarchy. While such constraints can be applied as a post-processing step, integrating them into the mining algorithm can dramatically reduce the execution time. We consider the problem of integrating constraints that are boolean expressions over the presence or absence of items into the association discovery algorithm. We present three integrated algorithms for mining association rules with item constraints and discuss their tradeoffs.


  15. K. Ali, S. Manganaris, R. Srikant, "Partial Classification using Association Rules", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, August, 1997.

    Many real-life problems require a partial classification of the data. We use the term ``partial classification'' to describe the discovery of models that show characteristics of the data classes, but may not cover all classes and all examples of any given class. Complete classification may be infeasible or undesirable when there are a very large number of class attributes, most attributes values are missing, or the class distribution is highly skewed and the user is interested in understanding the low-frequency class. We show how association rules can be used for partial classification in such domains, and present two case studies: reducing telecommunications order failures and detecting redundant medical tests.


  16. B. Lent, R. Agrawal, R. Srikant, "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, August, 1997.

    Abstract: We describe a system we developed for identifying trends in text documents collected over a period of time. Trends can be used, for example, to discover that a company is shifting interests from one domain to another. Our system uses several data mining techniques in novel ways and demonstrates a method in which to visualize the trends. We also give experiences from applying this system to the IBM Patent Server, a database of U.S. patents.


  17. D. Ho, R. Agrawal, N. Meggido and R. Srikant, "Range Queries in OLAP Data Cubes", Proc. of the ACM SIGMOD Conference on Management of Data, Tucson, Arizona, May 1997.

    Abstract: A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL.

    For range-sum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at run-time. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the sub-cube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/(b^d) of the size of the d-dimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary information, but the overall time complexity is typically reduced significantly. We also discuss how the precomputed information is incrementally updated by batching updates to the data cube. Finally, we present algorithms for choosing the subset of the data cube dimensions for which the auxiliary information is computed and the blocking factor to use for each such subset.

    Our approach to answering range-max queries is based on precomputed max over balanced hierarchical tree structures. We use a branch-and-bound-like procedure to speed up the finding of max in a region. We also show that with a branch-and-bound procedure, the average-case complexity is much smaller than the worst-case complexity.


  18. K. Shim, R. Srikant and R. Agrawal, "High-dimensional Similarity Joins", Proc. of the 13th Int'l Conference on Data Engineering, Birmingham, U.K., April 1997.

    Abstract: We present a new index structure, called the epsilon-K-D-B tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence the index scales to high-dimensional data. Empirical evaluation, using synthetic and real-life datasets, shows the join time for the epsilon-K-D-B tree is 3 to 10 times less than the join time for the R+ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the epsilon-K-D-B tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the epsilon-K-D-B tree.


  19. R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant, "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996.


  20. R. Srikant and R. Agrawal, "Mining Quantitative Association Rules in Large Relational Tables", Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.

    Abstract: We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be ``10% of married people between age 50 and 60 have at least 2 cars''. We deal with quantitative attributes by fine-partitioning the values of the attribute and then combining adjacent partitions as necessary. We introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. We tackle this problem by using a ``greater-than-expected-value'' interest measure to identify the interesting rules in the output. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a real-life dataset.


  21. R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations and Performance Improvements", Proc. of the Fifth Int'l Conf. on Extending Database Technology (EDBT), Avignon, France, March 1996.

    Abstract: Given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items, the problem of mining sequential patterns is to discover all sequential patterns with a user-specified minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequential pattern is ``5% of customers bought `Foundation' and `Ringworld' in one transaction, followed by `Second Foundation' in a later transaction''. We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-specified time window. Third, given a user-defined taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy.
    We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and real-life data indicates that GSP is much faster than the earlier AprioriAll algorithm. GSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average data-sequence size.


  22. R. Srikant and R. Agrawal, "Mining Generalized Association Rules", Proc. of the 21st Int'l Conf. on Very Large Databases, Zurich, Switzerland, Sep. 1995.

    Abstract: We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy ( is-a hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy that says that jackets is-a outerwear is-a clothes, we may infer a rule that ``people who buy outerwear tend to buy shoes''. This rule may hold even if rules that ``people who buy jackets tend to buy shoes'', and ``people who buy clothes tend to buy shoes'' do not hold. An obvious solution to the problem is to add all ancestors of each item in a transaction to the transaction, and then run any of the algorithms for mining association rules on these ``extended transactions''. However, this ``Basic'' algorithm is not very fast; we present two algorithms, Cumulate and EstMerge, which run 2 to 5 times faster than Basic (and more than 100 times faster on one real-life dataset). We also present a new interest-measure for rules which uses the information in the taxonomy. Given a user-specified ``minimum-interest-level'', this measure prunes a large number of redundant rules; 40% to 60% of all the rules were pruned on two real-life datasets.


  23. R. Agrawal and R. Srikant, "Mining Sequential Patterns", Proc. of the 11th Int'l Conference on Data Engineering, Taiwan, March 1995.

    Abstract: We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction.


  24. R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conf. on Very Large Databases, Santiago, Chile, Sep. 1994.

    Abstract: We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database.