![]() |
DML Statements and Performance
|
Reference Manual |
Performance and OptimizationA DBMS often has a choice about the access path for retrieving data. For example, the DBMS may use an index (fast lookup for specific entries) or scan the entire table to retrieve the appropriate rows. In addition, in statements in which two tables are joined, the DBMS can choose which table to examine first. Optimization means that DBMS makes the best (optimal) choice of access paths and join order. True query optimization means that the DBMS will usually make a good choice regardless of how the query is written. The optimizer does not necessarily make the best choice, just a good one. Cloudscape's query optimizer can use indexes to improve the performance of DML (data manipulation language) statements such as queries, updates, and deletes. It also makes decisions about join order, type of join, and a few other matters. This section gives an overview of the Cloudscape optimizer and discusses performance issues in the execution of DML statements. This section covers the following topics: Index Use and Access PathsIf you define an index on a column or columns, the query optimizer can use the index to find data in the column more quickly. Cloudscape automatically creates indexes to back up primary key, foreign key, and unique constraints, so those indexes are always available to the optimizer, as well as those that you explicitly create with the CREATE INDEX command. The way Cloudscape gets to the data--via an index or directly via the table--is called the access path. This section covers the following topics:
What Is an Index?An index is a database structure that provides quick lookup of data in a column or columns of a table. It is probably best described through examples. For example, the Flights table in the toursDB database has three indexes:
This means there are three separate structures that provide shortcuts into the Flights table. Let's look at one of those structures, OrigIndex. Table 4-1 shows a portion of the Flights table from toursDB. Figure 4-1 Partial view of the Flights table OrigIndex stores every value in the orig_airport column, plus information on how to retrieve the entire corresponding row for each value, as shown in Figure 4-2:
Figure 4-2 The index on the orig_airport column helps Cloudscape find the rows for which orig_airport is equal to a specific value. When an index includes more than one column, the first column is the main one by which the entries are ordered. For example, the index on (flight_id, segment_number) is ordered first by flight_id. If there is more than one flight_id of the same value, those entries are then ordered by segment_number. An excerpt from the entries in the index might look like this: 'AA1111' 1
Indexes are helpful only sometimes. This particular index is useful when a statement's WHERE clause is looking for rows for which the value of orig_airport is some specific value or range of values. SELECTs, UPDATEs, and DELETEs can all have WHERE clauses. For example, OrigIndex is helpful for statements such as the following:
SELECT *
SELECT *
SELECT * DestIndex is helpful for statements such as the following:
SELECT * The primary key index (on flight_id and segment_number) is helpful for statements such as the following:
SELECT *
SELECT *
SELECT * The next section discusses why the indexes are helpful for these statements but not for others. What's Optimizable?As you learned in the previous section, Cloudscape may be able to use an index on a column to find data more quickly. If Cloudscape can use an index for a statement, that statement is said to be optimizable. The statements shown in the preceding section allow Cloudscape to use the index because their WHERE clauses provide start and stop conditions. That is, they tell Cloudscape the point at which to begin its scan of the index and where to end the scan. For example, a statement with a WHERE clause looking for rows for which the orig_airport value is less than BBB means that Cloudscape must begin the scan at the beginning of the index; it can end the scan at BBB. This means that it avoids scanning the index for most of the entries. An index scan that uses start or stop conditions is called a matching index scan. NOTE: A WHERE clause can have more than one part. Parts are linked with the word AND or OR. Each part is called a predicate. WHERE clauses with predicates joined by OR are not optimizable. WHERE clauses with predicates joined by AND are optimizable if at least one of the predicates is optimizable. For example:
SELECT * FROM Flights In this example, the first predicate is optimizable; the second predicate is not. Therefore, the statement is optimizable. NOTE: In a few cases, a WHERE clause with predicates joined by OR can be transformed into an optimizable statement. See OR Transformations. Directly Optimizable PredicatesSome predicates provide clear-cut starting and stopping points. A predicate provides start or stop conditions, and is therefore optimizable, when:
WHERE orig_airport.toString() = 'SFO'
WHERE flight_id = 'AA1200' AND flight_date = CURRENT_DATE
Indirectly Optimizable PredicatesSome predicates are transformed internally into ones that provide starting and stopping points and are therefore optimizable. Predicates that use the following comparison operators can be transformed internally into optimizable predicates: For details on these and other transformations, see Appendix A, "Internal Language Transformations". JoinsJoins specified by the JOIN keyword are optimizable. This means that Cloudscape can use an index on the inner table of the join (start and stop conditions are being supplied implicitly by the rows in the outer table). Note that joins built using traditional predicates are also optimizable. For example, the following statement is optimizable:
SELECT * FROM Countries, Cities Covering IndexesEven when there is no definite starting or stopping point for an index scan, an index may speed up the execution of a query if the index covers the query. An index covers the query if all the columns specified in the query are part of the index. These are the columns that are all columns referenced in the query, not just columns in a WHERE clause. If so, Cloudscape never has to go to the data pages at all, but can retrieve all data through index access alone. For example, in the following queries, OrigIndex covers the query:
SELECT orig_airport
SELECT DISTINCT orig_airport.toLowerCase() Cloudscape can get all required data out of the index instead of from the table. Single-Column Index ExamplesThe following queries do not provide start and stop conditions for a scan of OrigIndex, the index on the orig_airport column in Flights. However, some of these queries allow Cloudscape to do an index rather than a table scan because the index covers the query.
-- Cloudscape would scan entire table; comparison is not with a
-- Cloudscape would scan entire table; <> operator is not optimizable
-- not valid operator for matching index scan
-- method invocation is not simple column reference Multiple-Column Index ExampleThe following queries do provide start and stop conditions for a scan of the primary key index on the flight_id and segment_number columns in Flights:
-- the where clause compares both columns with valid
-- the first column is in a valid comparison
-- LIKE is transformed into >= and <=, providing
-- segment_number is in the index, but it's not the first column;
-- Cloudscape would scan entire table; comparison of first column Useful Indexes Can Use QualifiersMatching index scans can use qualifiers that further restrict the result set. Remember that a WHERE clause that contains at least one optimizable predicate is optimizable. Nonoptimizable predicates may be useful in other ways.
SELECT * The second predicate is not optimizable, but the first predicate is. The second predicate becomes a qualification for which Cloudscape evaluates the entries in the index as it traverses it.
When a Table Scan Is BetterSometimes a table scan is the most efficient way to access data, even if a potentially useful index is available. For example, if the statement returns virtually all the data in the table, it is more efficient to go straight to the table instead of looking values up in an index, because then Cloudscape is able to avoid the intermediate step of retrieving the rows from the index lookup values.
SELECT * In the Flights table, most of the airport codes begin with letters that are less than Z. Depending on the number of rows in the table, it is probably more efficient for Cloudscape to go straight to the table to retrieve the appropriate rows. However, for the following query, Cloudscape uses the index:
SELECT * Only a few flights have airport codes that begin with a letter less than B. Indexes Have a Cost for Inserts, Updates, and DeletesCloudscape has to do work to maintain indexes. If you insert into or delete from a table, the system has to insert or delete rows in all the indexes on the table. If you update a table, the system has to maintain those indexes that are on the columns being updated. So having a lot of indexes can speed up select statements, but slow down inserts, updates, and deletes. NOTE: Updates and deletes with WHERE clauses can use indexes for scans, even if the indexed column is being updated. Updatable cursors cannot use indexes on updatable columns. How Indexes Affect CursorsFor updatable cursors, the query optimizer avoids any index that includes an updatable column. Specifying a list of column names in the FOR UPDATE clause allows the optimizer to choose an index on any column not specified. To restrict the columns that are updatable through the cursor, you can specify a list of column names in the FOR UPDATE clause. For example, the following statement cannot use an index during execution and instead must scan the entire table:
-- Cloudscape cannot use the index on flight_id because However, the following statement can use indexes:
-- Cloudscape can use the index on flight_id because it is not an Joins and PerformanceJoins, SQL-J statements in which Cloudscape selects data from two or more tables using one or more key columns from each table, can vary widely in performance. Factors that affect the performance of joins are join order, indexes, and join strategy. Join Order OverviewThe Cloudscape optimizer usually makes a good choice about join order. This section discusses the performance implications of join order. In a join operation involving two tables, Cloudscape scans the tables in a particular order. Cloudscape accesses rows in one table first, and this table is now called the outer table. Then, for each qualifying row in the outer table, Cloudscape looks for matching rows in the second table, which is called the inner table. Cloudscape accesses the outer table once, and the inner table probably many times (depending on how many rows in the outer table qualify). This leads to a few general rules of thumb about join order:
Figure 4-3 In a query that joins table a with table b with no restrictions in a WHERE clause, Cloudscape scans table a once, and for each qualifying row (four rows), it looks for data in table b. If there is no useful index on table b, Cloudscape scans the entire table each time. If there is a useful index on table b (on the joining column, fruit), Cloudscape can do a quick index lookup for each row (four lookups).
Join StrategiesThe default join strategy in Cloudscape is called a nested loop. For each qualifying row in the outer table, Cloudscape uses the appropriate access path (index or table) to find the matching rows in the inner table. Another available type of join in Cloudscape is called a hash join. For joins of this type, Cloudscape constructs a hash table representing all the selected columns of the inner table. For each qualifying row in the outer table, Cloudscape does a quick lookup on the hash table to get the inner table data. Cloudscape has to scan the inner table or index only once, to build the hash table. Nested loop joins are preferable in most situations. Hash joins are preferable in situations in which the inner table values are unique and there are many qualifying rows in the outer table. Hash joins require that the statement's WHERE clause be an optimizable equijoin:
The hash table for a hash join is held in memory and if it gets big enough, it can cause the JVM to run out of memory. The optimizer makes a very rough estimate of the amount of memory required to make the hash table. If it estimates that the amount of memory required would exceed 1 MB, the optimizer chooses a nested loop join instead. NOTE: Cloudscape allows multiple columns as hash keys. Cloudscape's Cost-Based OptimizationIn Version 3.0, the query optimizer makes cost-based decisions to determine:
About the Optimizer's Choice of Access PathIn deciding whether to use an index or to do a table scan, the optimizer estimates the number of rows that will be read (see When a Table Scan Is Better). Table scans always require scanning the entire table, so all the rows in the table will be read. The optimizer knows how many rows are in the table and does not need to estimate the number (see Optimizer Accuracy). When an index is available, the optimizer must estimate the number of rows that will be read. The optimizer will make the most accurate cost estimates about whether to use an index to access the data in cases when the search values are known. When the exact start and stop conditions are known at compilation time, the optimizer makes a very precise estimate of the number of rows that will be read. If the index is unique, and the WHERE clause involves an = or IS NULL comparison to all the columns in the index, the optimizer knows that only a single row will be read. In other circumstances, such as in the case when the statement's WHERE clause involves dynamic parameters that are known only at execution time and not at compilation time, or when the statement involves a join, the optimizer has to make a rougher estimate of the number of rows that will be read. For example, the optimizer will make an accurate estimate of the cost of the following statement:
SELECT * because the search value, 'SFO', is known. The optimizer will be able to make an accurate estimate of the cost of using the index orig_index. As explained in the section When a Table Scan Is Better, it may not be the best plan if the value 'SFO' appears in a large proportion of column orig_airport. Stored prepared statements created or recompiled with sample values also allow the optimizer's decision to be well informed. On the other hand, in the following statements, the search values are not known in advance:
-- dynamic parameters
-- joins
-- complex search conditions In the above SELECT statements, the optimizer's decision as to whether to use indexes is less well informed. Estimating Row Counts for Unknown Search ValuesThe way the optimizer estimates the number of rows that will be read when search values are unknown in advance is as follows: It multiplies the number of rows in the table by the selectivity for a particular operation. An operation's selectivity is a fixed number that attempts to describe the percentage of rows that will probably be returned; it may not correspond to the actual selectivity of the operation in every case. It is an assumption hard-wired into the Cloudscape system. For example, if a particular operation has a selectivity of .5 (50%) and there are 100 rows in the index, the optimizer estimates that it will have to touch 50 rows. It uses this estimate in its cost estimate. Table 4-1 shows what selectivity is assumed for various operations.
About the Optimizer's Choice of Join OrderThe optimizer chooses the optimal join order as well as the optimal index for each table. The join order can affect which index is the best choice. The optimizer may choose an index as the access path for a table if it is the inner table, but not if it is the outer table (and there are no further qualifications). The optimizer chooses the join order of tables only in simple FROM clauses. Most joins using the JOIN keyword are flattened into simple joins, so the optimizer chooses their join order. The optimizer does not choose the join order for outer joins; it uses the order specified in the statement. When selecting a join order, the optimizer takes into account:
NOTE: Cloudscape does transitive closure on qualifications. For details, see Transitive Closure. Join Order Case StudyFor example, consider the following situation: The Flights table (as you know) stores information about flight segments. It has a primary key on the flight_id and segment_number columns. This primary key constraint is backed up by a unique index on those columns. The FlightAvailability table, which stores information about the availability of flight segments on particular days, may store several rows for a particular row in the Flights table (one for each date). You want to see information about all the flights, and you issue the following query:
SELECT * First imagine the situation in which there are no useful indexes on the FlightAvailability table. Using the join order with FlightAvailability as the outer table and Flights as the inner table is cheaper because it allows the flight_id/segment_number columns from FlightAvailability to be used to probe into and find matching rows in Flights, using the primary key index on Flights.flight_id and Flights.segment_number. This is preferable to the opposite join order--with Flights as the outer table and FlightAvailability as the inner table--because in that case, for each row in Flights, the system would have to scan the entire FlightAvailability table to find the matching rows (since there is no useful index--one on the flight_id/segment_number columns). Second, imagine the situation in which there is a useful index on the FlightAvailability table (this is actually the case in the sample database). FlightAvailability has a primary key index on flight_id, segment_number, and booking_date. In that index, the flight_id-segment_number combination is not unique, since there is a one-to-many correspondence between the Flights table and the FlightAvailability table. However, the index is still very useful for finding rows with particular flight_id/segment_number values.
SELECT * Although the difference in cost is smaller, it is still cheaper for the Flights table to be the inner table, because its index is unique, whereas FlightAvailability's index is not. That is because it is cheaper for Cloudscape to step through a unique index than through a non-unique index. About the Optimizer's Choice of Join StrategyThe optimizer compares the cost of choosing a hash join (if a hash join is possible) to the cost of choosing a nested loop join and chooses the cheaper strategy. For information about when hash joins are possible, see Join Strategies. In some cases, the size of the hash table that Cloudscape would have to build is prohibitive and can cause the JVM to run out of memory. For this reason, the optimizer has an upper limit on the size of a table on which it will consider a hash join. It will not consider a hash join for a statement if it estimates that the size of the hash table would exceed 1 MB. The optimizer's estimates of size of hash tables are approximate only. NOTE: This limit does not apply if the query explicitly specifies hash join in an optimizer override (see Overriding the Optimizer's Choice of Join Strategy). About the Optimizer's Choice of Sort AvoidanceSome SQL statements require that data be ordered, including those with ORDER BY, GROUP BY, and DISTINCT. MIN() and MAX() aggregates also require ordering of data. Cloudscape can sometime avoid sorting steps for: Cloudscape can also perform the following optimizations, but they are not based on cost: Cost-Based ORDER BY Sort AvoidanceUsually, sorting requires an extra step to put the data into the right order. This extra step can be avoided for data that are already in the right order. For example, if a single-table query has an ORDER BY on a single column, and there is an index on that column, sorting can be avoided if Cloudscape uses the index as the access path. Where possible, Cloudscape's query compiler transforms an SQL-J statement internally into one that avoids this extra step. For information about internal transformations, see Sort Avoidance. This transformation, if it occurs, happens before optimization. After any such transformations are made, the optimizer can do its part to help avoid a separate sorting step by choosing an already sorted access path. It compares the cost of using that path with the cost of sorting. In Version 3.0, it does this for statements that use an ORDER BY clause in the following situations:
ORDER BY specifies a priority of ordering of columns in a result set. For example, ORDER BY X, Y means that column X has a more significant ordering than column Y. The situations that allow Cloudscape to avoid a separate ordering step for statements with ORDER BY clauses are:
-- The comparison of segment_number And because of transitive closure, this means that even more complex statements can avoid sorting. For example:
-- transitive closure means that Cloudscape will For more information about transitive closure and other statement transformations, see Appendix A, "Internal Language Transformations".
SELECT * In this case, the values clauses is flattened so that s.x is known to be a constant, and sorting can be avoided. About the Optimizer's Selection of Lock GranularityWhen a system is configured for row-level locking, the optimizer decides whether to use table-level locking or row-level locking for each table in each DML statement. The optimizer bases this decision on the number of rows read or written for each table, and on whether a full conglomerate scan is done for each table. NOTE: When you have turned off row-level locking for your system, Cloudscape always uses table-level locking. The optimizer's decisions are ignored. The first goal of the optimizer's decision is concurrency; wherever possible, the optimizer chooses row-level locking. However, row-level locking uses a lot of resources and may have a negative impact on performance. Sometimes row-level locking does not provide much more concurrency than table-level locking. In those situations, the optimizer may choose to escalate the locking scheme from row-level locking to table-level locking to improve performance. For example, if a connection is configured for TRANSACTION_SERIALIZABLE isolation, the optimizer chooses table-level locking for the following statement:
SELECT * To satisfy the isolation requirements, Cloudscape would have to lock all the rows in both the FlightAvailability and the Flights tables. Locking both the tables would be cheaper, would provide the same isolation, and would allow the same concurrency. NOTE: You can force lock escalation for specific tables when you create or alter them with a SET LOCKING clause. For these tables, Cloudscape always chooses table-level locking. For more information, see CREATE TABLE statement in the Cloudscape Reference Manual. How the Optimizer Makes Its Decision at Compile TimeThe optimizer determines the statement execution plan when a statement is compiled. A statement's isolation level may change at execution time; the optimizer cannot know what the statement's runtime isolation level will be. Therefore, the optimizer makes a decision about the lock granularity (whether to lock rows or entire tables) for a statement assuming that the statement will have TRANSACTION_SERIALIZABLE isolation, since that is the most restrictive. The optimizer uses the following rules:
Here are some examples that illustrate the choices the optimizer makes at compile time:
-- Optimizer chooses table locking for full scan
-- Small number of rows matched,
-- Large number of rows matched,
-- Presume other_table is inner table
-- Row locking, because only one row inserted
-- Table locking on my_table, because of -- Table locking on my_table because of bulk insert Lock Escalation ThresholdThe system property cloudscape.locks.escalationThreshold determines the threshold for number of rows touched for a particular table above which the optimizer will choose table-level locking. The default value of this property is 5000. For large systems, set this property to a higher value. For smaller systems, lower it. This property also sets the threshold for transaction-based lock escalation (see Transaction-Based Lock Escalation). Runtime OverridesThe optimizer's assumptions about isolation level would diminish the benefits of having a lower isolation level if its decisions could not be overridden. Before executing statements, the runtime system makes the following override:
For all other statements (which require exclusive locks), the system lets the optimizer's compile-time decisions stand. NOTE: For more information about lock escalation, see Locking and Performance. About the Optimizer's Selection of Bulk FetchWhen Cloudscape retrieves data from a conglomerate, it can fetch more than one row at a time. Fetching more than one row at a time is called bulk fetch. By default, Cloudscape fetches 16 rows at a time. Bulk fetch is faster than retrieving one row at a time when a large number of rows qualify for each scan of the table or index. Bulk fetch uses extra memory to hold the prefetched rows, so it should be avoided in situations in which memory is scarce. Bulk fetch is automatically turned off for updatable cursors, for hash joins, for statements in which the scan returns a single row, and for subqueries. It is useful, however, for table scans or index range scans:
SELECT * The default size for bulk fetch (16 rows) typically provides good performance. Overriding the OptimizerThe optimizer usually makes a good choice about access path and join order. In addition, the default join strategy (nested loop) and the default bulk fetch buffer size (16) are usually appropriate. However, in some complex situations, the optimizer may not make the best choice, and the defaults may not be optimal. Cloudscape allows you to add various PROPERTIES clauses to an SQL-J statement to force a particular choice or to choose a value other than the default. You can override the optimizer's choice of: In addition, you can tune the ways the optimizer and the system escalate locking to improve performance. To override the optimizer, use an optimizer override property in the SQL-J statement's FROM clause or after a particular tableExpression. See Chapter 6, "Optimizer Overrides", for complete details on optimizer override properties. Overriding the Optimizer's Choice of Access PathYou may want to force a particular access path such as a table scan. For example:
SELECT * See the index and constraint properties in Chapter 6, "Optimizer Overrides". Overriding the Optimizer's Choice of Join OrderThe joinOrder property in the FROM clause allows you to "fix" a particular join order--the order of items as they appear in the FROM clause. Otherwise, the optimizer makes its own choice about join order. For example:
SELECT Country, City See the joinOrder property in Chapter 6, "Optimizer Overrides". Overriding the Optimizer's Choice of Join StrategyCloudscape allows you to override the optimizer's choice of join strategy in a PROPERTIES clause. In most cases, the optimizer makes a good choice of join strategy. For example:
SELECT * See the joinStrategy property in Chapter 6, "Optimizer Overrides". Overriding the Optimizer's Choice of Fetch SizeCloudscape allows you to override the default bulk-fetch buffer size (16) in a PROPERTIES clause. In future releases, the optimizer will automatically choose the best bulk-fetch size for each table in a query.
SELECT * See the bulkFetch property in Chapter 6, "Optimizer Overrides". Tuning Lock GranularityAs explained in the section About the Optimizer's Selection of Lock Granularity, you can tune the ways both the optimizer and the runtime system escalate locks from row-level locking to table-level locking to improve performance. See also cloudscape.locks.escalationThreshold. Optimizer AccuracyThe optimizer estimates are not guaranteed to be accurate. The optimizer uses a stored row count to determine the number of rows in a table, which is maintained automatically by the system. However, it is not stored for tables that were created prior to Version 1.5 and upgraded. This could cause the optimizer to come up with suboptimal plans for these tables. Normally, an updated value is stored in the database whenever the database goes through an orderly shutdown (as long as the database is not read-only). Stored row counts become inaccurate if there is a non-orderly shutdown (for example, a power failure or other type of system crash). You can correct the optimizer's row count without shutting down the system; Cloudscape sets the stored row count for a table to the correct value whenever a query that does a full scan on the base conglomerate finishes. For example, executing the following query sets the row count for table Flights to the correct value:
SELECT * Cloudscape also sets the stored row count on a table to the correct value whenever a user creates a new index or primary key, unique, or foreign key constraint on the table. This value is not guaranteed to be written to disk. Providing Costing Information for VTIsCloudscape allows you to make costing information available to the optimizer with some optional extensions. NOTE: You can specify that a VTI class can only be instantiated once during the compilation and execution of a statement. VTIs can be chosen for hash joins. By implementing the optional interface COM.cloudscape.vti.VTICosting, your VTI class can provide the following information to the Cloudscape optimizer:
For information about programming VTI classes to provide such information, see Providing Costing Information in the Cloudscape Developer's Guide. |
||||||||||||||||||||||
|
![]() Cloudscape Version 3.6 For information about Cloudscape technical support, go to: www.cloudscape.com/support/.Copyright © 1998, 1999, 2000 Informix Software, Inc. All rights reserved. |