A hybrid hash join process joins data rows from two tables which have at least one common data column by partitioning the data rows based on the values in the common data column(s), creating data structures to decrease search time for matching rows, and recovering full data buffers using a unique buffer management methodology. A smaller one of the two tables is designated as an outer table and a larger one of the two tables as an inner table. The hybrid hash join process determines which rows in the inner and outer tables satisfy selection criteria; the rows that satisfy the selection criteria are referred to as inner hit rows and outer hit rows. The hybrid hash join process assigns the inner and outer hit rows to corresponding inner and outer partitions, respectively. Buffer overflow in the outer partitions is handled by linking empty buffers to the outer partitions until all buffers are used (Dobb, 2004, 20-26).
Buffer space is recovered by writing outer hit rows for a selected partition to mass storage. Buffer overflow in the inner partitions is handled by either writing the inner rows in the buffer to mass storage or by searching the corresponding outer partition for matches. The outer hit rows in the corresponding outer partition are allocated to entries in a data structure which is then probed to find matches for the inner hit rows in the buffer. Matching outer hit rows for any inner hit rows written to mass storage are found by repeatedly reading into memory inner and outer hit rows and probing a data structure created from the memory-resident outer hit rows in a partition. Optionally, binary trees are built from the entries in the data structure to speed up the probing procedure. The hybrid hash join process uses at least one hashing algorithm to assign hit rows to partitions, to allocate rows to entries in the data structures, and to probe the data structures for matches.
Disucssion
Join operations fall into two basic categories: an "equijoin" operation in which the selection criteria contains an equality condition, such as requiring a match on one or more columns common to two tables, as opposed to other join operations specify either less-than or greater-than conditions. In either case, the algorithm that performs the join operation must compare each row in one table with each row in every other table designated in the join operation (IEEE Transactions, 2005, 776-784). In a computer system with insufficient memory to hold all the tables in memory, a join operation becomes quite costly in terms of file I/O time. Equijoin operations are of particular focus for reducing processing costs as they are the most prevalent type of ad-hoc query issued against a relational database in a transactional environment.
Various algorithms have been developed to minimize the processing costs associated with join operations. Three major types currently in use are nested-loops, sort-merge, and hash-based. All of the existing algorithms in these categories have flaws, ...