Summary Report: An Overview of Query Optimization In Relational Systems
Submitted by: Ivan Frederick Course Advisor: Dr. Susan. D. Urban Course: Advanced Database Management Systems
Broad Overview: The research on query optimization is very huge and has been done for a long time. This paper does not focus on the entire research but more on the SQL query optimization in relational database systems and the foundations and basics along with the current significant works in this field as it is not possible to explain the entire comprehension pertaining to query optimization. Introduction: Among relational query languages which provide high level declarative interface to access data, SQL has become the standard for usage. Query optimizer and query execution engine are the major components for query evaluation. Query execution engine implement a set of physical operators such as index scan, sequential scan and they take as input one or more data streams and produce a data stream as output and are also referred as execution plan. Query optimizer which provides the input for query execution engine takes a parsed representation of SQL query and gives an efficient execution plan and this is a very difficult task as the algebraic representations can be transformed into many other logical representations and also there may be many operator trees and also differences in throughput. To solve these problems we have to define a search space, cost estimation technique and also an enumeration algorithm. The optimizer is said to be good if it has low cost search space, accurate cost technique and an efficient enumeration algorithm. System-R Optimizer: This model advanced the state of query optimization and SPJ queries which encapsulate conjunctive queries are a subset of this model. SPJ queries use linear sequences of joins and the join can have either the nested-loop or sort-merge implementation. The cost model estimates partial and complete plan costs based on the number of relations and indexes, formulas to estimate selectivity of predicates and formulas to estimate CPU and I/O costs along with the output data stream size for every operator and also other statistics. Dynamic programming and interesting orders are techniques for enumeration algorithms where dynamic programming approach is faster than na誰ve approach and it gets optimality based on optimality of sub expressions. Search Space: This depends on the algebraic transformations and also physical operators. The transformations do not necessarily reduce cost but must be applied in cost-based manner by enumeration algorithm for better results. Query trees and query graphs are some approaches to analyze the structure of queries and they have nodes along with additional information and also may not be suitable to all queries. Commuting Between Operators: Some operations are focused here are the join sequencing where it need not always be linear as joins are commutative and associative. Also deferral of Cartesian products may also result in poor performances as they reduce the cost. Outer joins and joins do not freely commute
but we can use an identity where “block of joins” precedes “block of outerjoins” for better efficiency. Also by pushing down the group-by before join helps in reducing tuple size and thereby cost. Reducing Multi-Block Queries to Single Block: For queries having one or more relations as views and each view is defined using conjunctive query, simply unfolding the view definitions can give us the single block SQL. In presence of a group-by operator, pulling up the group-by and reordering of joins has to be done to get optimality. If there is more than one subquery, the inner query block should be evaluated only once and inner query block should contain no variables from the outer query block. Using Semijoin like Technique for Optimizing Multi-block Queries: Here semi-joins are used so that we can get all tuples from the first relation and relevant tuples from the second relation rather than all the tuples. This helps to avoid redundant computation in nested queries and thus reduces cost. Statistics and Cost Estimation: There are different ways of implementing algebraic expressions in a query and hence we need to select the less costing way with respect to resources such as CPU time, memory, bandwidth etc as optimization is only as good as its cost estimates. The basic strategy is to collect statistical summaries of data and determining the output data stream and cost of execution where the statistical summary is a logical property and cost is a physical property. Statistical Summaries of Data: The necessary information for tables is the number of tuples, cost of data scans, memory requirements and number of joins. Many systems use the histogram for information on the data distribution on a column where the histogram divides the number of columns into k buckets. We can use min, max values if histograms are not available. For enterprise databases with voluminous data, ability to estimate statistical parameters accurately and efficiently is needed to increase accuracy. One such approach to estimate histograms can be done by sampling. Information should also be used on the operators such as selection for efficiency and ad-hoc constants are used in absence of histograms. This has to be done as information on base data alone is not sufficient. Cost Computation: The cost is computed and estimated using CPU, I/O and communication costs and is then converted to a overall metric to decide the best possible plan. In addition to physical properties of input data streams and selectivity, modeling buffer pool ratios also play a key role in determining cost. This includes choosing different buffer pool ratios based on the level of indexes and also by adjusting buffer utilization.
Enumeration Architectures: This involves choosing an inexpensive execution plan for a given query based on the search space. The challenge and desired thing for software engineering is to build extensible optimizers which are capable of adapting to changes in the search space such as addition of transformations, physical operators. However it is a bit difficult to develop such optimizers. Two examples of such optimizers along with their banality in them such as use of cost functions, use of rule engine, exposed knobs are discussed in detail. Starburst: It has a structural representation called Query Graph Model which has 2 phases, the query rewrite phase in which transformation takes place and plan optimization phase in which execution plans are chosen which contain the estimated cost and physical properties. Volcano/Cascades: Transformation and implementation rules are used are used along with memorization which performs dynamic programming by checking if the task has already been executed using the promise action. Also there is only one single step for execution with goal-driven application of rules. Distributed and Parallel Databases: Distributed databases evolved into replicated architectures where maintaining the consistency among replicas was the major issue and parallel databases where communication among the processors for data exchange and the need for re-partitioning and scheduling are the major issues. User-Defined Functions: Stored procedures, popular in the relational system is another area of concern. They are used to reduce client-server communication and to incorporate application semantics in querying. The issues occur when determining the cost model and enumeration algorithms. Materialized Views and other Optimization Issues: The results of the views that are used by the optimizer pose 2 fundamental issues: a) reformulating the queries which has been addressed for single block SQL but not for complex processes. b) The second issue is addressing the two step process of optimization. Deferring generation of complete plans and fuzzy queries are another area of concern. Conclusion: Query optimization is not only about transformations and equivalence as infrastructure for optimization is difficult. Design of effective transformations is hard, designing and defining cost metrics and enumeration algorithms is even difficult and thus significant problems still persist and hence a strong understanding of engineering framework is required to better query optimization.