[SQL Cost-based Query Optimization in Data Mining Environment]
by
Acknowledgement
I would take this opportunity to thank my research supervisor, family and friends for their support and guidance without which this research would not have been possible.
Declaration
I, [type your full first names and surname here], declare that the contents of this dissertation/thesis represent my own unaided work, and that the dissertation/thesis has not previously been submitted for academic examination towards any qualification. Furthermore, it represents my own opinions and not necessarily those of the University.
Signed __________________ Date _________________
ABSTRACT
Mining frequent patterns across multiple datasets has received a lot of research interest recently. In this paper, we investigate cost-based query optimization approaches to efficiently evaluate such mining tasks. Specifically, we make the following contributions: 1) We present a rich class of queries on mining frequent itemsets across multiple datasets supported by a SQL-based mechanism. 2) We present an approach to enumerate all possible query plans for the mining queries, and develop a dynamic programming approach and a branch-and-bound approach based on the enumeration algorithm to find optimal query plans with the least mining cost. 3) We introduce models to estimate the cost of individual mining operators. 4) We evaluate our query optimization techniques on both real and synthetic datasets and show significant performance improvements.
5.1 Dynamic Programming forOptimizedQuery Plan Generation25
5.2 Using Branch-and-Bound toGenerateOptimal Query Plans27
5.3 Cost Estimation29
6. EXPERIMENTAL EVALUATION32
6.1 Datasets33
6.2 Experimental Settings33
6.3 Experimental Results34
7. RELATED WORK36
8. CONCLUSIONS37
REFERENCES39
Approach for SQL Cost-Based Query Optimization in Data Mining Environment
1. INTRODUCTION
In the last decade, frequent pattern mining has emerged as a very useful class of techniques for analyzing datasets from a variety of domains, including retail transactions, DNA sequences, chemical compounds, and XML documents. At this time, the major relational DBMSs have all implemented data mining tools including a frequent itemset operator (Oracle 10g, IBM DB2, SQL Server 9.0). Most existing research and techniques focus on mining frequent patterns in a single dataset. Recently, there has been increasing research interest in mining multiple datasets. While query optimization of simpler multi-dataset queries has been studied extensively, only a few heuristic approaches have been suggested for multi-dataset frequent itemset queries, with limited results. We show that a costbased approach using query plan enumerating and pruning can answer such queries substantially faster for very little planning cost. Given the huge potential speedup and low planning cost, we expect the optimization of frequent pattern mining queries to become a necessary part of OLAP database engines over time, as has been the case with traditional query optimization for many years. (Zaki and Li 2007 225)
We investigated an important class of frequent pattern mining tasks involving the discovery of interesting patterns from multiple, distinct datasets (Listed as the Q1-type query in Section ...