SQL Cost-Based Query Optimization In Data Mining Environment

[SQL Cost-based Query Optimization in Data Mining Environment]

by

Acknowledgement

I would take this opportunity to thank my research supervisor, family and friends for their support and guidance without which this research would not have been possible.

Declaration

I, [type your full first names and surname here], declare that the contents of this dissertation/thesis represent my own unaided work, and that the dissertation/thesis has not previously been submitted for academic examination towards any qualification. Furthermore, it represents my own opinions and not necessarily those of the University.

Signed __________________ Date _________________

ABSTRACT

Mining frequent patterns across multiple datasets has received a lot of research interest recently. In this paper, we investigate cost-based query optimization approaches to efficiently evaluate such mining tasks. Specifically, we make the following contributions: 1) We present a rich class of queries on mining frequent itemsets across multiple datasets supported by a SQL-based mechanism. 2) We present an approach to enumerate all possible query plans for the mining queries, and develop a dynamic programming approach and a branch-and-bound approach based on the enumeration algorithm to find optimal query plans with the least mining cost. 3) We introduce models to estimate the cost of individual mining operators. 4) We evaluate our query optimization techniques on both real and synthetic datasets and show significant performance improvements.

Table of Contents

ABSTRACT1

1. INTRODUCTION3

2. SQL EXTENSIONS FORMININGACROSS MULTIPLE DATASETS5

3. ALGEBRA AND QUERY EVALUATION9

3.1 Basic Algebra for Queries9

3.2 Basic Tools for Query Optimization11

3.3 M-table: A Unified Query Evaluation Scheme12

4. QUERY PLAN ENUMERATION15

4.1 Query Plan Space15

4.2 Partial Orders and Equivalent Query Plans17

4.3 Reducing Search Space20

4.4 Enumeration Algorithm23

5. OPTIMIZINGQUERY PLANS WITH COST ESTIMATION25

5.1 Dynamic Programming forOptimizedQuery Plan Generation25

5.2 Using Branch-and-Bound toGenerateOptimal Query Plans27

5.3 Cost Estimation29

6. EXPERIMENTAL EVALUATION32

6.1 Datasets33

6.2 Experimental Settings33

6.3 Experimental Results34

7. RELATED WORK36

8. CONCLUSIONS37

REFERENCES39

Approach for SQL Cost-Based Query Optimization in Data Mining Environment

1. INTRODUCTION

In the last decade, frequent pattern mining has emerged as a very useful class of techniques for analyzing datasets from a variety of domains, including retail transactions, DNA sequences, chemical compounds, and XML documents. At this time, the major relational DBMSs have all implemented data mining tools including a frequent itemset operator (Oracle 10g, IBM DB2, SQL Server 9.0). Most existing research and techniques focus on mining frequent patterns in a single dataset. Recently, there has been increasing research interest in mining multiple datasets. While query optimization of simpler multi-dataset queries has been studied extensively, only a few heuristic approaches have been suggested for multi-dataset frequent itemset queries, with limited results. We show that a costbased approach using query plan enumerating and pruning can answer such queries substantially faster for very little planning cost. Given the huge potential speedup and low planning cost, we expect the optimization of frequent pattern mining queries to become a necessary part of OLAP database engines over time, as has been the case with traditional query optimization for many years. (Zaki and Li 2007 225)

We investigated an important class of frequent pattern mining tasks involving the discovery of interesting patterns from multiple, distinct datasets (Listed as the Q1-type query in Section ...

SQL Cost-Based Query Optimization In Data Mining Environment

SQL For Oracle

SQL

The Role Of Statistics In...

Structured Query Language...

Finance Queries