SQL
Home Up Search Trademarks how to use

For best results: this site requires that cookies be enabled for proper operation - see Legal Page for more info

 

Select Any of These

SQL

LAST UPDATED: 08 November 2007 18:26:36 -0600

IN GENERAL, FAVOR DATABASE JOINS OVER APPLICATION LOGIC

A popular misconception when processing database data is that it is more efficient to merge and filter data in a program rather than filtering and joining the data in a database--in SQL. This misconception would have SQL statements that include only a small number of tables and incomplete filtering criteria in each. They would return rows to a program that would then merge, filter, and process them.

Moving data across database address spaces to application address spaces (and in more extreme cases, across the network) consumes a large amount of system resources. It is often more efficient to perform much of the work in the database and return fewer rows to the application program.

Where possible, perform data filtering and joining in the database. The coding will be easier and performance often will improve. Of course, this needs to be analyzed and tested on a case-by-case basis.

INSERTS, UPDATES, AND DELETES--AND VIEWS

A common popular misconception is that Inserts, Updates, and Deletes cannot be performed on Views. This is only partly true. These statements generally cannot be performed on Views that are "Join Views." An example of a Join View is shown by the Join Select statement below within the "Create View" syntax.

 

Create view Party_Member_View as

Select column1, column2, column3

From Party A, Party B

Where A.party_id = B.member_party_id;

DB2 and SQL Server will not allow Inserts, Updates, and Deletes on this View. Oracle also will not allow it as default behavior, but "Instead of" triggers can be created to allow the statements to be processed.

 

 

HOW MANY TABLES SHOULD BE JOINED IN ONE SQL STATEMENT?

The answer used to be three--maybe four. This has changed. The real limits are those of your DBMS and the performance requirements of the system you are implementing. It is possible to have a large number of tables joined that perform well if only a small number of rows are processed.

It is more important to understand the data being processed and overall system performance requirements when determining the best way to code joins. Arbitrary rules that limit the number of tables included in a join are meant to protect system performance, but, in fact, may have the adverse effect. It is more important to consider and analyze each SQL statement than it is to have these types of rigid rules.

Write your join the way you naturally would like to, then run Explain (DB2 and Oracle) or Showplan (SQL Server) on it and test the query on a reasonably sized database to ensure that it works as expected.

 

 

THE USE OF "NOT" IN A QUERY

A popular misconception is that the word NOT should not be used in the Where clause of a query. This idea stems from the fact that the statement NOT is often not indexable by many databases today. For example, a case where this is true is the following SQL statement, where the column "customer_name" is indexed:

Select * >From Customer Where customer_name NOT like 'JONES%';

In this example, the index on customer_name will not be used. But notice the following SQL statement, where customer_id is indexed:

Select * >From Customer Where customer_id between 100000 and 120000 and customer_name NOT like 'JONES%';

In the above example, an index on customer_id may be used while the NOT statement also eliminates some rows. The NOT statement can be useful in eliminating rows at the database level.

 

 

 

MORE USES OF "NOT" IN A QUERY

In some cases where only indexed columns are needed, the NOT statement can use an index. An example of this is shown below.

Consider the case of an index where two columns, customer_id and customer_name, are indexed in one index, and these are the only columns needed in a query. The SQL looks like the following:

Select customer_id, customer_name >From Customer Where customer_name NOT like 'JONES%';

In this case, the index with both columns may be scanned if the number of blocks/pages to be read is considerably less than if the table were read.

 

 

 

AVOIDING "NOT" IN A QUERY

In some cases, the NOT statement should be avoided in a query. These situations crop up when it might be easier to write the statement as NOT, but you know that index access will be much more efficient when a positive, more complex statement is chosen.

Consider the following situation: You want to see all customers outside of New York City when you know that 99% of customers are in New York, and you know that you want to use an index to perform this search. The simplest statement to write is

Select * >From Customer Where City NOT like 'NEW YORK';

This will scan the entire table, while returning only 1% of the rows. The statement can be rewritten in a fashion similar to the following:

Select * >From Customer Where (City between ' ' and 'NEW YORJ') or (City between 'NEW YORKA' and 'ZZZZZZZZZ');

This should be tested with an Explain or showplan to see if the index is used. If not, consider the use of host variables as arguments in the query.

 

BITMAPPED INDEXES

An interesting and useful type of index that is little understood is a 'bitmap' or 'bitwise' index. This is implemented in most popular relational databases as a standard option. It works a little differently than the standard B-tree index.

With the bitmap index, a bit of 0 or 1 is set for existence and non-existence of a value.

A good example of using a bitmap index is a column that is a flag with values Yes or No. Assume ten rows with six values 'Y' and four values 'N'. The columns look like this:

YNNYNYYYYN

The bitmap index looks like this:

for value 'Y' 1001011110 for value 'N' 0110100001

So, separate bitmaps are stored for each distinct value of a column, and compression is used to provide efficiency with this type of index.

This is a valuable option to consider for speeding up SQL queries. In a future tip, we will look at when bitmap indexes should be used.

 

 

BITMAPPED INDEXES--WHAT ARE THEY GOOD FOR?

Bitmap indexes as described in yesterday's tip have the following advantages:

· They are very efficient for queries on large tables with relatively few distinct values and high query access.

· They are quick on aggregate (count) queries.

· They can be used in any order and in any combination, compared to concatenated indexes.

· They are very compact due to compression (more compact than concatenated B-tree indexes).

· Merging of multiple bitmaps is very fast.

· 'Star joins' can make great use of bitmap indexes.

· Many databases can use bitmap and B-tree indexes in the same query.

Bitmaps have many advantages. Look up their use in your database of choice.

 

 

 

BITMAPPED INDEXES--WHAT ARE THE DISADVANTAGES?

Bitmaps should not be used in the following situations:

range queries or columns with a high number of distinct values (though this limit is increasing due to compression techniques) cases where locking is required at the row level. Oracle (for instance) cannot lock a single bit, so locking is at the block level for a bitmap index and--due to compression and the high density of bitmap indexes (one of its strengths)--thousands of rows can be locked by a single seemingly efficient SQL insert, update, or delete statement. any tables where the bitmapped column has a high number of insert, update, or delete statements against the bitmapped columns range queries, which cannot take advantage of bitmap indexes Please review the specifics of bitmap indexes as supplied by the database and release that you are running to fine-tune these tips.

--------------------------------------------------------------------------------

PERFORMING A UNION

The Union statement allows the values of two SQL statements to be combined. An example of Union is shown below:

Select customer_id, customer_name >From Customer Where customer_name like 'JONES%' Union Select customer_id, customer_name >From Customer Where customer_name like 'SMITH%';

The above statement could be coded as an 'OR' in a single SQL select, but the Union helps to simplify and optimize some types of SQL statements.

In a Union, the number of columns and data types must be the same, though the table names and column names can be different.

Union is not the same as Union All. With a Union All, all rows selected by all queries are retrieved. In contrast, the Union statement strips out duplicate rows and returns only distinct rows from all queries in the Union statements.

--------------------------------------------------------------------------------

 

 

PERFORMING UNION WHERE COLUMNS DO NOT MATCH

With the Union statement, the number of columns and data types of each must match. In other words, the following is an invalid statement, since the first Select in the Union has more columns than the second:

Select col01, col02, col03, col04 >From Table1 Union Select col06, col07 >From Table2;

How can this be corrected when Table2 has only two columns? If col03 and col04 are of data type char(4), the following can be used:

Select col01, col02, col03, col04 >From Table1 Union Select col06, col07, ' ', ' ' >From Table2;

The data types and numbers of columns in both Select statements are the same, making this statement legal. In other cases, NULL values may be used and data type conversions may be needed to write a legal Union statement.

 

 

GETTING A ROW WITH THE MAX VALUE

Finding a row in a table with a maximum value in a certain column seems to be a simple thing to do. Let's say, for example, that we want to retrieve all columns in the Sales table for the row with the largest sale amount as defined in column sale_amount. This needs to be written as a subquery where the subselect finds the row with the max(sale_amount) and the outer query then retrieves all columns for that row. Assuming the sale_id is unique in the Sales table, this query can be written as follows:

Select s1.sale_id, s1.col02,...s1.coln, s1.sale_amount >From Sales S1 Where s1.sale_amount in (Select max(s2.sale_amount) >From Sales S2);

This would retrieve only the single row with the maximum sale_amount in it.

--------------------------------------------------------------------------------

SOME INDEXING TIPS

Consider the following indexing tips when you are optimizing your SQL statements.

· For oltp (online transaction processing with a high number of insert, update, and delete statements), indexing will almost always be B-tree.

· Limit indexing on columns that are frequently changed (insert, update, delete).

· If an Update statement does not change columns that are indexed, then the high cost of index maintenance is avoided.

· Question cases where tables with high change activity have more than three indexes. This is not necessarily a problem, but it is a rough rule-of-thumb that can be used to highlight potential problems.

Tomorrow we will look at some more indexing tips.

 

 

MORE INDEXING TIPS

More things to think about when you are trying to speed up your SQL through the use of indexes are:

· Make sure that all table access is supported by indexes (unless the table is small, such as fewer than six blocks).

· Reorg/rebuild and 'runstats'(DB2)/'analyze... compute statistics' (oracle) or 'update statistics' (SQL Server) indexes very frequently.

· The order of columns in indexes is very important: most commonly used columns at front of index and cardinality (uniqueness) of leading columns should be high.

· When joining tables, if the join is performed using more than one column with a concatenated index, then the join will perform more efficiently if the index columns are in the same order on both tables. Performance will be even further enhanced if the Table data is in the same order as those indexes (if the cluster ratio is high).

 

 

 

SEARCHING FOR AND COMPARING NULL VALUES

Null values are a controversial and often discussed item in database theory. To simplify this matter, the issue around a Null value is that we are never sure if a column is Null because it does not apply or because it simply is unknown. In other words, a Null value can be interpreted in many ways. Because of this, one Null value can never be considered to be equal to, not equal to, greater than, or less than another, because we cannot compare one indeterminate value to another. Therefore, the following statement will never return a value--even if the customer_name is NULL:

Select * >From Customer Where customer_name = '';

When comparing Null values, use the 'IS NULL' value rather than ''. The following corrects this statement:

Select * >From Customer Where customer_name IS NULL;

 

 

 

NULL INDEXES

Be careful of index entries that are completely null--an index entry may not be kept. So, queries using IS NULL logic will be forced to perform a table scan. This is true in Oracle, and you should verify that it will not cause an issue with your database and application.

 

 

 

USE PROPER SEARCH COLUMNS IN WHERE CLAUSES

Where clauses in a Select statement should include the leading (at least the first) columns of a concatenated index. Consider an index that has three columns: last_name, first_name, middle_initial. If the majority of queries will use at least the last_name column, then this should be the first column of the index. An example of this Select statement is shown below:

Select last_name, first_name >From Customer Where last_name = 'JONES';

If the last_name column were not the first column in the index, then either an entire index scan or table scan would need to be performed. With last_name as the first column of the table, then a matching index lookup may be performed.

Consider how the SQL should be written when creating indexes and, conversely, consider the order of columns in an index when writing your SQL.

 

 

 

USING CONCATENATED INDEX PROPERLY

You need to create only one index on a table that includes all columns needed. Queries needing a subset of the columns will use this index as long as the leading columns are used as predicates in the query.

For example, if an index has columns column01, column02, and column03, then any of the three following queries will use this one index effectively:

Select * >From table_name Where column01 = :host_variable1 and column02 = :host_variable2 and column03 = :host_variable3;

Select * >From table_name Where column01 = :host_variable1 and column02 = :host_variable2;

Select * >From table_name Where column01 = :host_variable1;

In fact, if the cardinality (degree of uniqueness) of column01 is very high, then the following query may also use this index effectively:

Select * >From table_name Where column01 = :host_variable1 and column03 = :host_variable3;

Notice, however, that if the query used only columns column02 and column03, then an index on column02 and column03 may be needed. Also, if another query using only column03 were created, then a third index with only column03 would need to be created--or alternatively, a single index could be created in order for column03, column02 to satisfy both queries.

Note that at most three indexes would need to be created, each with a different leading column and each with fewer columns. In many cases, the creation of only one index will solve your performance problems.

 

 

 

WHEN TO USE AN INDEX

Indexes will not solve all performance problems. In fact, databases have very efficient built-in features to scan entire tables. As a (very) rough rule of thumb, if more than 20% of a table is being read, then a table scan will often be faster than an index-based lookup.

Some database features include read-ahead or prefetch capabilities, where the next set of blocks is read from physical storage before it is needed by the database. Another high performance feature is the ability to perform parallel queries by breaking up a large query into several small ones.

In general, consider creating an index for performance reasons if a query returns a small number of rows and the query is run frequently.

 

 

 

COMPARING DIFFERENT DATA TYPES

When performing a join of two tables, make sure that the data types of the join columns match. If the data types do not match, then several things may happen (depending on the database, type of query, and data type incompatibility issues being faced).

· A join operation that should use an index may not use the index.

· The statement may not be valid and may return a compile or runtime error.

· The absolute case is that the statement may run and return invalid results.

Make sure that columns that are to be joined are of the same data type, length, and precision (number of decimal places).

 

 

 

WHY DOES MY QUERY RETURN NO ROWS?

You run a very simple query and return no rows. This makes no sense to you. You have run controlled tests, and the data in your Where clause is exactly the same as the data in the database. You also know that all of the data is uppercase, so that is not an issue. Your pseudo-code looks as follows:

declare :host_variable char(20); set :host_variable := 'JONES'; Select * >From Customer Where customer_name = :host_variable;

This returns no rows even though you know that 'JONES' is in the database. Check the data type of customer_name. Depending on the database you use, there is a good chance that the data type is varchar (varchar2), and the trailing spaces cannot be matched. Make sure that your data types match those of the database exactly.

 

 

CREATING A TABLE QUICKLY

You want to create a copy of a table quickly using the Create Table table_name As Select... statement, but you want only the table structure. In other words, you do not want to move all of the rows from the source table to the target table--you want the target table to be empty. How do you accomplish this? Use the following statement:

Create Table new_table as Select * from old_table Where 0 = 1;

The Where 0 = 1 is false and ensures that no rows will be found from the source table. The new_table will be created empty.

 

 

 

CURSORS AND WHY THEY ARE NEEDED

The SQL language has been designed to work on sets (tables) rather than on one row or record at a time. This does not match the structure of most programming languages such as COBOL, C, C++, and Java, which operate on one row at a time. Host variables can handle only one value at a time, and table values cannot be read directly into arrays. To resolve this mismatch, an SQL feature called Cursors was created to provide a bridge between the database 'set' structure and the programming language 'row-at-a-time' one. It is, in effect, a pointer to the row you want processed in a table.

A named cursor can be declared for an SQL statement as follows:

Declare Cursor Read_Customer for Select last_name, first_name >From Customer;

This cursor can then be opened with the 'open cursor' statement and navigated or processed one row at a time with the Fetch statement, which is performed repetitively until all required rows have been read. The cursor is then closed.

 

 

 

OPEN CURSOR STATEMENT In a procedural language such as C, COBOL, or Java, SQL statements can be processed through cursors (as seen in our previous tip). Once a cursor has been declared, it can then be opened through the Open Cursor statement. An example of this is shown below:

Declare Cursor Read_Customer for Select last_name, first_name >From Customer;

Open Cursor Read_Customer;

The rows aren't returned to the program. This statement prepares the program to return rows to it by the next Fetch statement. If a sort is required to return rows, then the sort operation may be performed at Open Cursor time before the first Fetch is performed.

Open cursors only when they are needed. Do not automatically open all cursors that might be used as a default action.

 

 

FETCH STATEMENT AND SCROLLING

Once a cursor has been declared, a pointer can be set to the first row to be returned, and all remaining rows can be returned through the Fetch statement. The Fetch positions a pointer to the next row of a result set and returns the single row to the calling program. Program host variables are used to return the individual columns, as shown here:

Fetch Read_Customer into :last_name, first_name;

The above Fetch gets two columns and puts them into the :last_name and :first_name variables in the program. These are preceded with colons to distinguish the host variables from table columns.

 

 

CLOSE CURSOR

Cursors are explicitly closed when a program completes execution. However, close a cursor as soon as you've finished with it. This will free valuable system resources. Also, only close those cursors that have been opened. The Close Cursor statement is shown here:

Close Cursor Read_Customer;

Do this rather than letting the program completion implicitly close a cursor.

 

 

OPEN CURSOR WITH HOLD (DB2)

In DB2, open cursors are implicitly closed when a Commit is performed. This causes problems with reopening and repositioning a cursor when the next Fetch statement is processed. To keep a cursor open and maintain cursor position for Fetches across Commit statements, use the With Hold option when the cursor is declared, as follows:

Declare Cursor Read_Customer with hold for Select last_name, first_name >From Customer;

 

 

PERFORMING UPDATES WITH CURSORS

When you know that a cursor subsequently will be used to update rows, use the For Update Of option when declaring the cursor. This prevents other programs from changing data between the time you read a row and the time that you perform the update. Also, only name the columns to be updated when specifying For Update. An example of this is shown here:

Declare Cursor Read_Customer for Select last_name, first_name >From Customer For Update first_name;

 

 

THE WHERE CURRENT OF CLAUSE

When either updating or deleting rows that were initially read through a cursor, use the Where Current Of clause to state that you want to update or delete the row that the cursor is pointing to. For example:

Update Customer set first_name = 'David' where current of Read_Customer;

In this example, a row was read (Fetched) by the Read_Customer cursor and subsequently updated by the statement above.

 

 

CURSORS AND SQL SERVER

Microsoft SQL Server provides the capability to create cursors using ANSI SQL or Transact-SQL syntax. The two cannot be used together. When possible, use the ANSI SQL syntax, since this will provide more code portability across databases and across versions of SQL Server.

 

 

STATIC VERSUS DYNAMIC SQL

There are two basic types of SQL statements: Static SQL and Dynamic SQL. Static SQL is the most common and is used when the SQL statement can be written with hardcoded values or with host variables. The SQL statement and Where clause are well understood and can be determined before the program is run.

In some cases, however, it is difficult to determine the best way to write a SQL statement before a program is actually run, because it can be difficult to cover all the combinations of SQL and resulting program logic that need to be written for some situations. An example of this is a window that allows users to input many combinations of search criteria on the screen--for example, first name, last name, city, street, postal (zip) code, or any combination of these, including range searches.

A good solution to this type of problem is to build the SQL and Where clause dynamically, based on the values entered on the screen. This Dynamic SQL can benefit both program structure and performance when used in the correct way.

There are four types of Dynamic SQL: execute immediate; non-select; fixed-list select; and varying-list select.

The next several tips will discuss how to build Dynamic SQL.

 

 

DYNAMIC SQL WITH EXECUTE IMMEDIATE

The simplest way to code Dynamic SQL is to use the execute immediate feature. All that needs to be done to perform an execute immediate is to place a SQL statement in a host variable and execute it using the execute immediate syntax. One big disadvantage of execute immediate statements is that they cannot use input or output parameters, so this statement cannot be used for a Select statement. It can be used for Insert, Update, and Delete, however, and it can be used to perform administrative functions such as security grants and revokes. An example of this is shown here:

set v_sql_statement = "delete from customer where last_name = 'JONES'" exec sql execute immediate :v_sql_statement end-exec

The execute immediate statement is especially good for simple, one-time tasks. For statements that must be run many times, it must be "Prepared" for every execution. In other words, the overhead of Prepare syntax checking and parsing must be performed with every single execution.

 

 

NON-SELECT DYNAMIC SQL: PREPARE AND PARAMETER MARKERS

Non-Select Dynamic SQL offers the same type of functionality as execute immediate, with one important advantage: the Prepare and Execute statements can be kept separate. This allows a statement to be Prepared once and then executed many times within a loop. The Prepare is performed only when the statement is actually issued, avoiding unnecessary overhead (compared to the implicit Prepare that is performed by execute immediate).

These types of statements can use a feature called a parameter marker, which allows for different values to be placed into the Where clause of an already Prepared statement for different executions. For example:

set :v_sql_statement = "Delete from Customer where customer_id = ?"

exec SQL Prepare customer_delete_statement from :v_sql_statement;

loop until condition_reached set :v_cust_id = 1; exec SQL customer_delete_statement using :v_cust_id; end loop

These statements show how the Prepare is performed outside a loop and how the Where clause predicates can be set inside the loop for each execution of the already prepared statement. This is accomplished through the parameter marker "?" and the variable v_cust_id (which is used to plug a value into the statement at execution time).

We've covered two important features of Dynamic SQL: the Prepare statement and parameter markers.

 

 

DYNAMIC SQL: FIXED-LIST SELECTS AND CURSORS

The coding of this feature builds on the concepts used in the Non-Select Dynamic SQL we discussed in our previous tip. We will once again take advantage of the Prepare statement and parameter markers. We will, however, need to provide the capability to handle output variables. These statements are relatively simple, since the number of input and output variables are known. Unlike our previous examples, we will be able to use Select statements with all of the capabilities provided here.

Consider the case of a SQL statement with two input variables that returns two output values and multiple rows. In this case, we need to write the SQL statement with parameter markers for the input variables. We can then Prepare the statement, and, once the Prepare has been performed, a cursor can be created using the Prepared statement. The cursor is needed to accommodate the multiple rows that were returned. A Fetch is then performed using the cursor inside a loop, and the results are returned to host variables. An example is shown here:

set :v_sql_statement = "Select last_name, first_name From Customer Where last_name = ? and first_name = ?";

exec SQL Prepare Read_Customer from :v_sql_statement;

exec SQL Declare Cursor Customer_Cursor FROM Read_Customer;

set :v_in_last_name = 'SMITH'; set :v_in_first_name = 'JOSEPH'; exec SQL Open Customer_Cursor using :v_in_last_name, :v_in_first_name;

loop until no_more_rows Fetch Customer_Cursor into :v_out_last_name, :v_out_first_name; end loop

The syntax used by your favorite database may differ from that used here, but the steps performed will be the same.

 

 

VARYING-LIST DYNAMIC SQL AND THE DESCRIPTOR AREA

The most flexible--and most complex--type of SQL to write is varying-list SQL, where the number of input parameters and output parameters is unknown. To accomplish this, we need to take advantage of a SQL Descriptor Area that describes the SQL statement as well as the number of values it inputs and returns.

To create a Descriptor for a statement where 15 parameters may be needed for input and output variables, code the Allocate Descriptor statements as follows:

Exec SQL Allocate Descriptor in_customer_desc_area with max 15; Exec SQL Allocate Descriptor out_customer_desc_area with max 15;

To Describe Input and Output parameters, use the Describe Input and Describe Output statements, along with a Prepared SQL statement, as follows:

set :v_sql_statement = "Select last_name, first_name From Customer Where last_name = ? and first_name = ?";

exec SQL Prepare Prepared_Read_Customer from :v_sql_statement;

Then, input and output parameter information can be retrieved by analyzing the Prepared SQL statement with Describe Output and Describe Input commands, as shown:

Exec SQL Describe Output Prepared_Read_Customer Using SQL Descriptor out_customer_desc_area;

Exec SQL Describe Input Prepared_Read_Customer Using SQL Descriptor in_customer_desc_area;

These statements put information about the variables into their respective SQL Descriptor areas. Their values must then be supplied using the Set Descriptor statement. To set the values, supply the Descriptor name, the number of parameters, and the data value. For example:

Exec SQL set Descriptor in_customer_desc_area value 1 DATA 'SMITH'; Exec SQL set Descriptor in_customer_desc_area value 2 DATA 'JOSEPH';

The SQL statement can then be run as follows:

Exec SQL Execute Prepared_Read_Customer Using in_customer_desc_area INTO out_customer_desc_area;

Again, this syntax may not match the syntax of your database of choice, but the steps performed will be the same. Describing SQL and working with varying lists of input and output values can be complicated, and you should fully understand the exact implementation by your DBMS before you decide to use this feature.

An example of how different syntax is used to support this feature is the SQLDA (SQL Descriptor Area) approach that is used by IBM's DB2.

 

DISADVANTAGES OF DYNAMIC SQL

Dynamic SQL has the advantages of providing a great deal of flexibility in coding SQL and simplifying program logic when many combinations of input variables are possible. However, there are some disadvantages to consider when deciding whether to use Dynamic SQL. Some of these are:

Dynamic SQL can be difficult to code and error-prone. Understand it completely before using it in a production program. Security can be an issue if the end user is given direct access to a table rather than an application or Binding ID. Some DBMSs get around this through SQL checking performed at Bind time rather than run-time. Performance can suffer from the extra steps that may need to be repeatedly performed at run-time. Some of these include steps to compile, parse, bind, optimize, and perform security checking of the statement. Consider carefully whether Dynamic SQL is right for you. Almost all of your production SQL will not require it.

--------------------------------------------------------------------------------

 

 

 

WHAT IS A CLUSTER RATIO?

Cluster ratios are an important factor to consider when analyzing indexes and table access. The cluster ratio is the percentage of table data that is sorted in the same order as a specific index.

An example of this is shown by two indexes on the Customer Table. Index1 is on the Customer_id, which is in random order and has a cluster ratio of 40 percent. So, the table is not ordered by Customer_id. This is not a problem, since all access through this index is a direct read and does not access many rows in a single query.

The second index, called Index2, is on the column Customer_Name and has a cluster ratio of 100. In other words, the Customer Table data is sorted in Customer_Name order. This is important, because a large number of searches by Customer_Name use wildcards (such as "Select * from Customer where Customer_Name Like 'SM%'"), and it is valuable to access index entries adjacent to each other and data entries that are also adjacent to each other in the same order. This will, in fact, lower the number of index and data blocks that need to be read to satisfy a query on Customer_Name and will improve performance.

Consider cluster ratios and the ways that your tables are queried when deciding how table data should be ordered.

--------------------------------------------------------------------------------

 

GENERATING SQL WITH SQL

In some situations, you will need to run quick SQL on a database--but you don't have the SQL at your fingertips. Some functions can be generated through a SQL query using imbedded literals in a SQL statement.

For example, let's say that you are asked to grant Select access to all of your tables to another user in the system. In other words, issue the Grant Select to userid_of_choice on TABLE statement. This can be generated quickly in two steps:

1. Run SQL against the database catalog using embedded literals. 2. Use the output from step 1 as the SQL input to step 2.

Here's an example of generating this in Oracle:

SELECT 'grant select on ' || table_name || ' to userid_of_choice;' From user_tables;

A similar approach can be used in DB2 using the systables catalog table and in SQL Server and Sybase using sysobjects.

--------------------------------------------------------------------------------

 

CARTESIAN PRODUCTS

An unfortunate result of SQL Joins with incorrect Where clauses is the cartesian product. This can occur when two tables are joined without using a Where clause.

Consider a case where an Employee table with 250 rows is joined to a Department table with 15 rows. If performed properly, the result set will include 250 rows, since each employee belongs only to one department. The correct SQL would look like this:

Select emp.*, Dept.* >From Employee emp, Department dept Where emp.department_id = dept.department_id;

A cartesian product is created when the Where clause is removed from this statement so that it is coded as follows:

Select emp.*, Dept.* >From Employee emp, Department dept;

With this SQL statement, every Employee row will be joined to every Department row. This must occur, since it is impossible for the database to determine how the rows will be matched up. The statement will therefore return 250*15 rows, for a final incorrect result set of 3,750 rows. This number isn't too huge; there are cases where millions of rows can be returned.

This incorrect result of every row from one table being joined to every row from another table is known as a cartesian product. It is also the source of many hilarious jokes between DBAs, such as: "What did you do? Create a cartesian product?!?!?"--unbelievably funny.

--------------------------------------------------------------------------------

 

ODBC--WHAT IS IT?

Microsoft's Open Database Connectivity (ODBC) is a widely used method to allow applications and tools to connect to different database types using a common API and database calls. This is based on X/Open and ISO Call Level Interface (CLI) specs that separate database connectivity into multiple layers. They can be thought of simply as a server-side and a client-side layer. This allows vendors to plug different database interfaces on the server side and different client interfaces on the client side.

ODBC allows for one set of interfaces to the database, one set of database access syntax, one set of data types, and one set of error codes by your applications to different database types.

--------------------------------------------------------------------------------

 

FOUR THINGS THAT MAKE UP ODBC

Open Database Connectivity (ODBC) consists of four layers:

1. An ODBC application, which calls the ODBC interface. Apps can be written in tools such as Visual Basic and PowerBuilder. They connect to the database, issue SQL, retrieve results from the database, check return codes, and disconnect from the database. These call ODBC functions that submit SQL requests and retrieve results.

2. The ODBC driver manager is called by the application, and this DLL in turn loads the appropriate ODBC driver to be used for this app. It routes function calls from the applications to the proper ODBC driver.

3. ODBC drivers are specific to a database. There are ODBC drivers for all major relational databases, such as SQL Server, Adaptive Server, Oracle, DB2, UDB, and Informix. Many nonrelational databases also support ODBC drivers to overcome their lack of SQL support. The ODBC driver processes ODBC function calls, submits requests to the database, and returns results to the driver manager.

4. The data source is the actual database to connect to. This would specify the exact database to connect to and the userid to connect with. In other words, a single ODBC driver may be used to set up many data sources.

--------------------------------------------------------------------------------

 

BE CAREFUL OF POTENTIAL ODBC OVERHEAD

ODBC is slower than native APIs for connecting to most databases. When not accessing a Microsoft database, use the native APIs, when possible. For example, with Oracle, ODBC sits on top of Net8 or SQL/Net and adds an extra layer to database access and connectivity.

For Microsoft SQL Server, however, ODBC sits at the same layer as DB-lib on the network stack, and the extra layers of code are avoided. This is the most popular database driver with SQL Server, but it will be surpassed by OLE DB, which is still relatively new.

--------------------------------------------------------------------------------

 

ODBC 3.0

ODBC 3.0 is the latest version of ODBC. It has new functions to simplify connections, initialization, and processing of server-side code such as stored procedures. When obtaining ODBC drivers from vendors, inquire about the level of ODBC compliance that their drivers support. Compliance with ODBC 3.0 may solve many problems that you encounter with ODBC.

--------------------------------------------------------------------------------

WHERE CAN I GET ODBC DRIVERS?

ODBC drivers can be purchased or obtained from many sources. They come with most major databases and can be downloaded from most major database Web sites such as IBM, Microsoft, Oracle, Sybase, and Informix. Third-party products are also available from companies such as Intersolv (now called Merant). Performance of different products varies, so test the different drivers if this is a concern.

--------------------------------------------------------------------------------

 

PARALLEL QUERY

Many of today's popular databases support a feature that allows database access to be performed in parallel. The specifics of how each database does this differ between implementations, but the basics are the same.

In the case of a query that is accessing a large number of rows, the Parallel Query Server will split the query into multiple parallel streams that read separate partitions of data. These partitions may be physical in that they are separate physical files. The partitioning may also be logical in that the database engine separates the data into two or more ranges of keys that span a single physical file. These parallel processes that read the data may be thought of as "producers," in that they gather the data and pass them on to the next stage.

The second stage of a Parallel Query implementation is where multiple query processes can perform sorting and other similar operations in parallel. These can be thought of as "consumers" of the data.

The final step is where a single Parallel Query coordinator gathers the data from the multiple query process consumers and merges all rows from the two or more consumer engines and returns the results to the application that issued the original query.

This processing can make large queries and table scans perform very quickly. The SQL usually does not need to be altered by the developer to take advantage of this feature, although Database Hints can sometimes be helpful. However, the database designer--or DBA--can assist this process by partitioning tables and implementing a database on a multi-CPU SMP machine or on an MPP-based architecture.

--------------------------------------------------------------------------------

 

INDEX MERGES

In some situations, more than one index can participate in a query through multi-index access. Consider the case of a Customer table that has two indexes. The first contains the column customer_name and the second contains the column postal_code. If a query needs to be written to return rows for anyone with a last name of SMITH or with a postal code of A1A1A1, the Select statement may be written as shown here:

Select * >From Customer Where customer_name = 'SMITH' or postal_code = 'A1A1A1';

Based on this query and the indexes on the Customer table, both indexes may be used. The customer_name index will be scanned for all SMITHs and the postal_code index will be scanned for A1A1A1. The successful index "hits" will all retrieve ROWIDs or RIDs, which subsequently will be merged. Once this index merge is complete, the table data rows will then be accessed.

Most major relational databases support this feature, but index access depends greatly on the characteristics of the data in the table and the indexes that have been created.

--------------------------------------------------------------------------------

 

TABLE ACCESS THROUGH VIEWS, SYNONYMS, AND ALIASES

Database tables can be accessed directly by naming the table itself in a query. In some cases, however, you may want to "redirect" access to a table by having a name used in a SQL statement point to a different physical table. This can be accomplished through the use of views, synonyms, and aliases.

Using one of these features can deliver the following capabilities:

renaming columns and tables without affecting application programs accessing tables with a different owner ID than the one used to issue the query querying a different database in a transparent manner to the querying application combining multiple joined tables to look like a single table and doing this in an optimized manner implementing security Views are implemented by all major relational databases; synonyms and aliases are unique implementation features.

--------------------------------------------------------------------------------

 

THE USE OF SYNONYMS

Although they may appear to be the same at first, views and synonyms have very different features.

Synonyms are simply a way of renaming a table or view to give it a private name and to reference an object without specifying its owner. They can be used to access remote objects so that the remote object appears to be local to the querying application. A synonym can be created on a table as follows:

Create synonym Synonym_Name for Owner.Table_Name;

Security cannot be issued directly on the synonym, since access must be granted on the underlying table or view. Also, notice that specific columns cannot be chosen in the synonym definition. It is merely a table or view rename.

In Oracle, a special feature called "public" synonyms can be used to allow all users to access a specific object (table or view) through a public synonym without needing to know the object's owner or creator ID.

Except for public synonyms, synonyms are personal objects that cannot be shared across userids.

--------------------------------------------------------------------------------

 

VIEWS

A view can be thought of as a "virtual" table. It has its own Owner (or Creator) name and can have its own View name as well as its own Column names. The view can be created on other tables and views. Views can also be created as joins between multiple tables. An example of view creation is shown here:

Create View as Select * from Customer, Customer_Address where customer_cust_id = cust_addr_cust_id;

Here are some important points to note regarding views:

Grants may be issued on views. Inserts, updates, and deletes may be issued on views except in the following cases: They are join views (except in the case of Oracle's Instead Of triggers); a column not included in the view either is nullable or has default values; functions are used in the view; or a view contains a constant value. Views may be shared and used by multiple userids when security has been accessed to those userids. Views can be used to simplify SQL processing. Views can be used to derive values based on table columns. When objects are dropped, dependent views are also dropped or disabled.

--------------------------------------------------------------------------------

 

ALIASES IN DB2

Aliases are used in DB2 as a distributed synonym. They allow access to distributed objects in a manner that is transparent to the issuing SQL statement.

These are different than synonyms in that multiple users can use the same alias, where users can use only their own synonyms (except for the case of Oracle's public synonyms). Also, when a table is dropped in DB2, aliases are dropped, but synonyms are not (in Oracle, synonyms are not dropped and may be used to access remote objects). Sample Create Alias syntax is shown here:

Create Alias for Table_name/View_name;

--------------------------------------------------------------------------------

 

GENERATING KEYS IN SQL SERVER

Unique keys may be created on a table in SQL Server using the Identity feature. This is a column property that allows a Not Null column to be automatically set and incremented when a new row is inserted. It is important to note that the Identity column cannot be nullable. An example of this feature is shown in the Product table here:

Create table Product (product_id int Identity (50000,1) not null, product_description char(50));

Note that the Identity column has an initial value of 50000 for the first insert and will be incremented by 1 for all subsequent Insert statements. The cust_id column also does not need to be referenced in Insert statements, since it will automatically be set.

--------------------------------------------------------------------------------

 

SETTING GLOBAL UNIQUE KEYS IN SQL SERVER

A feature similar to Identity is implemented through the uniqueidentifier property, which stores a global, unique, 32-digit hexadecimal value in the column that uses it. When implemented, this column is 16 bytes long. An example of this feature is shown in the Customer table here:

Create table Product (product_id uniqueidentifier, product_description char(50));

To insert rows into this table, you must use the newid() function to set the uniqueidentifier. An example of an insert into the Product table is

Insert into Product (product_id, product_description) Values (newid(), 'Database Product');

Create the uniqueidentifier as Not Null to use it as a primary key.

--------------------------------------------------------------------------------

 

GENERATING KEYS IN ORACLE

New keys can be generated in Oracle through a feature called a sequence. These are objects (such as tables, views, and indexes) that can be used by multiple users to generate primary keys. A sequence is created using the Create Sequence statement as shown here:

CREATE SEQUENCE Customer_Seq Increment by 1 Minvalue 1 Maxvalue 1000000000 Nocycle Start with 1 Cache 200 Order;

This sequence is called Customer_Seq and can be used to create a primary key such as customer_id on the Customer table. This sequence starts at 1, is incremented by 1, has a maximum value of 1 billion, and is processed in the order of request for a new sequence. Also, the next 200 entries are cached in memory for quicker retrieval by applications.

To use sequences, use the Oracle pseudocolumns Currval and Nextval. To see the current value of a sequence, you can use this Select statement:

Select customer_Seq.currval >From Dual;

To insert a row into the Customer table using the next sequence value, the Insert statement can be coded as follows:

Insert into Customer (customer_id, customer_name) Values (customer_seq.nextval, 'Database Company');

This simple method for generating new keys in Oracle databases is an Oracle-specific feature.

--------------------------------------------------------------------------------

 

OUTER JOINS--WHAT ARE THEY?

The most common and simple type of join to perform is an inner join. In this join, qualifying rows from one table are returned when they match qualifying rows from another. It is also referred to as a natural join. An example of this is shown in a join between two tables: Customer and Invoice, where we want to see all invoices for all customers named SMITH:

Select * >From Customer C, Invoice I Where C.customer_id = I.customer_id and C.customer_name = 'SMITH';

This will return all customers named SMITH that have been invoiced. Suppose, however, that you wanted to see all customers named SMITH regardless of whether they have been invoiced. To accomplish this, we need to do an outer join between Customer and Invoice. An example of this (using ANSI SQL) is shown here:

Select * >From CustomerC LEFT OUTER JOIN Invoice I ON C.customer_id = I.customer_id Where C.customer_name = 'SMITH';

An outer join, therefore, is a join between two tables where you want to see rows from at least one of the tables, regardless of whether a related row exists on the other table.

--------------------------------------------------------------------------------

 

LEFT OUTER JOIN AND RIGHT OUTER JOIN

As stated in our previous tip, an outer join is a join between two tables where you want to see rows from at least one of the tables regardless of whether a related row exists on the other table. If the table that contains the rows that you want to see is on the left of the query, it is called a left outer join; if it is in the right, it is a right outer join.

An example of a left outer join is one where all customers will be returned, and those with invoices will also have their invoice information retrieved. If a customer does not have invoices, its customer information will be retrieved anyway, and the invoice information will be null. This can be written as a left outer join, as shown here:

Select * >From Customer C LEFT OUTER JOIN Invoice I ON C.customer_id = I.customer_id;

It can also be written as a right outer join, like this:

Select * >From Invoice I RIGHT OUTER JOIN Customer C ON I.customer_id = C.customer_id;

Left and right outer joins are opposite concepts, but they can be used to accomplish the same thing.

--------------------------------------------------------------------------------

 

FULL OUTER JOIN

A full outer join is one where all rows from Table 1 are joined to Table 2 and all rows from both tables (matching the Where criteria) will be shown whether or not there is a row in the other table. An example of this is a case where we want to show all customers that live in a specific Zip code and all invoices with a value greater than $100,000, regardless of the Zip code they are in. This code can be written as follows:

Select * >From Customer C FULL OUTER JOIN Invoice I ON C.customer_id = I.customer_id Where C.zip_code = 12345 or I.total_value >= 100000;

This could return the following results:

CUST_ID ZIP_CODE INVOICE_ID INVOICE VALUE ------- --------- ---------- -------------- 1 12345 70 100 2 67890 80 200000 null null 85 175000 7 12345 null null This shows the following cases:

1) Row 1 with Cust_Id 1 has the correct Zip code (12345) as well as an invoice, even though the invoice value is less than $100,000. 2) Row 2 with Cust_Id 2 has an invoice value of $200,000, and this--as well as the customer information--are shown, even though the customer is in a different Zip code than requested in the query. 3) Row 3 with Invoice_Id = 85 does not have the corresponding customer information (obviously a data integrity problem, especially when we don't know about one of our largest customers), so the customer information is shown as having null values. 4) Row 4 with Cust_Id 7 is shown, even though it has no invoices, because it has the correct Zip code of 12345.

So, you can see that the full outer join combines all data that would be shown through a natural, left outer, and right outer join.

--------------------------------------------------------------------------------

 

DATABASE AND OUTER JOINS

When coding outer joins, favor left outer joins over right outer joins. Joins will be easier to understand when they are performed the same way. Some databases (such as DB2) will convert a right outer join to a left outer join, so the Explain may be confusing to read once this conversion has taken place.

DB2 and SQL Server syntax are ANSI-92 compliant, such that an outer join may be coded like this:

Select * From Customer C FULL OUTER JOIN INVOICE I ON C.customer_id = I.customer_id;

Oracle syntax is different and is not ANSI-compliant. To perform an outer join so that all customers are shown regardless of whether they have invoices, code a (+) beside the Invoice table name. This left outer join would be written as shown:

Select * From Customer C, INVOICE I ON C.customer_id = I.customer_id (+);

--------------------------------------------------------------------------------

 

OUTER JOINS AND UNIONS

Outer joins may be coded as a series of SQL statements using the Union operator. Consider the following full outer join:

Select c.cust_id, c.zip_code, i.invoice_id, i.invoice_value >From Customer C FULL OUTER JOIN Invoice I ON C.customer_id = I.customer_id Where C.zip_code = 12345 or I.total_value >= 100000;

Assume that this query returns these results:

CUST_ID ZIP_CODE INVOICE_ID INVOICE VALUE ------- --------- ---------- -------------- 1 12345 70 100 2 67890 80 200000 null null 85 175000 7 12345 null null This outer join could also be coded like this:

Select c.cust_id, c.zip_code, i.invoice_id, i.invoice_value >From Customer C, Invoice I Where (C.zip_code = 12345 or I.total_value >= 100000) and c.customer_id = i.customer_id UNION Select c.cust_id, c.zip_code, NULL, NULL >From Customer C2 Where C2.zip_code = 12345 and NOT EXISTS (Select INV.cust_id >From Invoice inv Where inv.cust_id = c2.cust_id) UNION Select NULL, NULL, i.invoice_id, i.invoice_value >From Invoice I2 Where I2.total_value >= 1000000 and NOT EXISTS (Select cust.cust_id >From Customer cust Where cust.cust_id = I2.cust_id);

In these Union statements, the first Select performs an inner join, the second Select gets all customers in zip_code 12345 without invoices, and the third Select gets all invoices with total_value >= 100000. Note that the left outer join and right outer join are performed as correlated Subselects.

--------------------------------------------------------------------------------

 

DECIDING ON OUTER JOIN VS. UNION/SUBSELECTS

We have two ways of performing outer joins: The Outer Join syntax can be used, or the SQL can be written as a series of Select statements that are Unioned together, where one Select performs an inner join and the others can simulate the outer join by performing a Subselect with the NOT Exists clause. Both can return the same results, so which should you use?

The Outer Join statements are much more efficient than the approach that combines Union statements and Subselects. The reason for this is that the Outer Join syntax allows the optimizer to derive a plan that processes each table only one time. The Union + Subselect approach requires the tables involved to be processed multiple times--once for each Select statement.

If you are coding SQL to run on multiple databases, consider using the Union + Subselect approach, since this will be portable across databases, while Outer Join syntax is not. Remember to consider the performance impact when making this decision.

--------------------------------------------------------------------------------

 

DATABASE OPTIMIZERS

A common feature of relational databases is the cost-based optimizer. The optimizer uses statistics stored in a database catalog and combines these with incoming SQL statements to determine which access paths to choose. The types of statistics that are used by the optimizer are number of rows, number of pages containing rows, data distribution/skewing statistics as shown through histograms, index leaf page count, cardinality of the first index column, cardinality of the entire index, index cluster ratio, and number of index levels.

Other environmental information will also be used by the optimizer. Some of these things include the CPU type and speed, as well as data buffer/cache size, sort area size, and storage information.

Based on this information, the optimizer will then calculate the CPU cost and I/O costs of executing a SQL statement using different access methods. The method with the lowest cost will be chosen.

All relational databases use different algorithms and information to determine access paths. Their approaches, however, are similar, and the access paths chosen are also very alike.

--------------------------------------------------------------------------------

 

OPTIMIZING JOINS--NOT ALL JOINS ARE THE SAME

The ability to join two or more tables using SQL syntax is simple. The most basic form of this join between two tables is shown here:

Select * >From Table1 T1, Table2 T2 Where T1.Table1_Id = T2.Table1_id;

Although this syntax is simple and relatively consistent across DBMSs, the access path chosen by a relational database (RDBMS) may vary considerably because of the database optimization logic, database statistics, and the DBMS itself. A specific join may also be chosen if an SQL "hint" is used to ask the optimizer to use that join technique.

A join can be processed in many ways. Some of the most common are nested-loop, sort-merge (aka merge-scan), hash, star, reformatting, and hybrid joins. This is an important concept to understand, and we will be reviewing the different join techniques over the next several days.

--------------------------------------------------------------------------------

 

SORT-MERGE JOINS

The sort-merge join--also called a merge-scan--is a very popular join performed when accessing a large number of rows. It is also seen in background tasks, batch processing, and decision-support systems.

Take an example of two tables being joined and returning a large number of rows (say, thousands), as follows:

Select * >From Table1 T1, Table2 T2 Where T1.Table1_Id = T2.Table1_id;

The merge-scan join will be chosen because the database has detected that a large number of rows need to be processed, and it may also notice that index access to the rows is not efficient, since the data is not clustered (ordered) efficiently for this join. Here are the steps to perform this type of join:

1) Pick an inner and outer table. 2) Access the inner table; choose the rows that match the predicates in the Where clause of the SQL statement. 3) Sort the rows retrieved from the inner table by the joining columns and store these as a temporary table (this step may not be performed if data is ordered by the keys, and efficient index access can be performed). 4) The outer table may also need to be sorted by the joining columns so that both tables to be joined are sorted in the same manner (this step is also optional and depends on whether the outer table is already well ordered by the keys and whether efficient index access can be ). 5) Read both outer and inner tables (these may be the sorted temporary tables created in previous steps), choosing rows that match the join criteria. This operation is very quick, since both tables are sorted in the same manner and Database Prefetch can be used. 6) Optionally sort the data one more time if a Sort was performed (for example, an Order By clause), using columns that are not the same as were used to perform the join.

The merge join can be deceivingly fast because of database prefetch capabilities and the fact that each table is accessed only one time each.

--------------------------------------------------------------------------------

 

HASH JOIN

The hash join is being implemented in most major DBMSs and is a very efficient join. With the hash join, one table is chosen as the outer table. This is the larger of the two tables in the join--and the other is chosen as the inner table. Both tables are broken into sections, and the inner table is "hashed." This hashing provides an algorithmic pointer that makes data access very efficient. The database attempts to keep the inner table in memory since it will be "scanned" many times.

The outer rows that match the query predicates are then selected, and for each outer table row chosen, hashing is performed on the key, and the hash value is used to quickly find the matching row in the inner table.

This join often can outperform a sort-merge join, particularly when one table is much larger than another. No sorting is performed, and memory usage is efficient for this join.

--------------------------------------------------------------------------------

 

STAR JOINS--PART 1 OF 2

A particular type of join common to data marts and data warehouses is known as the star join, or star query. This is a join of a large Fact table with two or more smaller tables commonly called Dimensions. Fact tables may be thought of as having transactional properties. An example of a Fact table is Sales, which contains keys and some measures. This is commonly a large table with millions--and sometimes billions--of rows. The Dimension tables are used to describe the Fact table. Examples of Dimension tables are Customer, Product, and Time. Star queries get their name because there is a single Fact table in the middle, and the smaller Dimension tables are directly related to the Fact table. There are two ways that relational databases commonly perform star joins. The first will be discussed in today's tip, and we'll cover the second in our next tip.

Consider the case of a central Fact table and three small Dimension tables related to the Fact table. The first and most simple approach for resolving a star query is to query the three Dimension tables using the predicates in the Where clause to filter out unnecessary rows. The resulting rows from the three Dimension tables can be joined, forming a Cartesian product. This is not as bad as it sounds, since these tables are relatively small, and the rows matching the Where criteria account for a subset of these. This Cartesian product can be sorted and stored as a temporary table. A merge-scan join can then be performed between the resulting temporary table and the Fact table.

Next time we'll look at another approach that the optimizer can take toward resolving star queries.

--------------------------------------------------------------------------------

 

STAR JOINS--PART 2 OF 2

A database optimizer can take a more complex approach toward star joins. Consider the case of the central Fact table that is being joined to three smaller Dimension tables as shown here:

Select * >From Fact, Dim1, Dim2, Dim3 Where Fact.dim1_id = Dim1.id and Fact.dim2_id = Dim2.id and Fact.dim3_id = Dim3.id and Dim1.name like :input_variable_name and Dim2.Descriprion between :input_var1 and :input_var2 and Dim3.Text < :input_variable_text;

The optimizer can transform this by adding three subqueries so that the new transformed query looks like this:

Select * >From Fact, Dim1, Dim2, Dim3 Where Fact.dim1_id = Dim1.id and Fact.dim2_id = Dim2.id and Fact.dim3_id = Dim3.id and Dim1.name like :input_variable_name and Dim2.Descriprion between :input_var1 and :input_var2 and Dim3.Text < :input_variable_text and Fact.dim1_id in (Select dim1.id from dim1 where dim1.name like :input_variable_name) and Fact.dim2_id in (Select dim2.id from dim2 where Dim2.descriprion between :input_var1 and :input_var2) and Fact.dim3_id in (Select dim3.id from dim3 where Dim3.Text < :input_variable_text);

Given this, the Subselects are performed first. If bitmap indexes exist on the Fact join columns, then the bitmap index entries that result from each subquery can be merged (in this case ANDed) together, and the Fact table data can be retrieved using the resulting index values. This operation is very quick. The Fact entries retrieved can then be joined to the Dimension tables to complete the query. Using this approach, a Cartesian product is not required.

Of course, there is one other way that a database may choose to perform a star query--and that is to break the query down to traditional nested-loop, sort-merge, and hash joins as it would with any other query.

--------------------------------------------------------------------------------

 

REFORMATTING JOINS

This join technique is used by Sybase Adaptive Server and many versions of Microsoft SQL Server.

Reformatting is chosen in cases where a suitable index cannot be found on an inner table, and where the inner table is large enough that repeated table scans are not a viable option. The reformatting option creates an internal table from the inner table and a temporary clustered index on it. This clustered index is created on the join column, and the join can then be performed between the outer table and the new temporary clustered index.

This is a variation on the sort-merge join, and it has similar pros and cons.

--------------------------------------------------------------------------------

 

LIST PREFETCH

Accessing indexed data that is not clustered (for example, not ordered by the index) can be inefficient. One way of making this type of data access more efficient is to use a technique called List Prefetch. With this approach, an index is used to retrieve the RIDs (rowids) of the corresponding table row. Once these RIDs have all been accessed, they are sorted by the RID so that they are in order of the physical block of data that must be accessed. Armed with this information, the database can then access table data sequentially, accessing each data block in order--and only once. The database can also take advantage of Sequential Prefetch, where many blocks of data may be read in a single I/O.

In the case where an Order By clause exists in the SQL, a sort may be required to put the data in correct order.

--------------------------------------------------------------------------------

 

MULTIBLOCK I/O

All major relational databases (RDBMSs) have a feature that allows for Multiblock I/O and, in some cases, Sequential Prefetch.

Multiblock I/O allows for a single I/O to read many blocks of data at one time. An example of this is a case where Multiblock I/O is set to 64, allowing for 64 blocks to be read as a single I/O. This feature is excellent when large volumes of data are read by a query, since it means that almost all data to be read will already exist in the database buffer/cache when it is needed. The only time we need to wait for a physical I/O is every 64th block, when the next group of blocks is to be read.

In Oracle, this is an initialization parameter that can be set to different values.

--------------------------------------------------------------------------------

 

SEQUENTIAL PREFETCH

An enhancement to Multiblock I/O is Sequential Prefetch. This uses the same feature as Multiblock I/O in that several blocks will be read into the buffer at one time, rather than the database reading a single block at a time. For increased system efficiency, the number of blocks being read should be a multiple of the database page size. Therefore, numbers such as 32 or 64 blocks are commonly read at one time.

Sequential Prefetch is an improvement over Multiblock I/O because the next group of blocks will be read as the current blocks are processed. Thus, as rows begin to be processed from this set of blocks (say, 64 blocks of data), the next 64 are being read into the buffer, so the database stays ahead of all physical I/O needs. This structure eliminates the need to wait for any physical database I/O, since all needed data has already been read into the buffer. In Multiblock I/O, a wait for a physical read may occur every 64 blocks.

--------------------------------------------------------------------------------

 

SEQUENTIAL DETECTION

In some cases, a database may detect that sequential processing is being performed and may turn on Sequential Prefetch to speed up a query. This is called Sequential Detection. The following sample algorithm as used by DB2 demonstrates this point.

If the Sequential Detector value is 32, then in cases where nine out of 16 pages (think of the pages as blocks) are sequential out of 32 pages, then sequential detection is turned on automatically by the DBMS, and the next group of pages (or blocks) will be read.

The algorithm for sequential detector S is: If (S/4 +1) out of S/2 are accessed sequentially within a group of S, then Sequential Detection can be turned on. Similarly, Sequential Detection can be turned off if this condition is not met.

--------------------------------------------------------------------------------

 

THE HYBRID JOIN

The hybrid join is unique to DB2 and is a mixture of the nested-loop join and list-prefetch features. The hybrid join works as follows.

Consider the SQL statement where a medium-sized (for example, hundreds of rows) result set will be returned:

Select * >From Customer outer, Invoice inner Where outer.customer_name = inner.customer_name and outer.customer_name = 'SMITH';

The hybrid join is chosen and the following steps will be performed. First, access the outer table by index or by sorting the data so that outer-table access will be in the order of the join column on the inner table. The outer table will be accessed by customer_name.

Once an outer table row is retrieved, access the corresponding index rows in the inner table. Combine the inner-table RID (rowid) from the inner-table index, and attach this to the row of data from the outer-table rows retrieved. A temporary table has been created that consists of the outer-table row and the inner-table RID.

Next, sort the temporary table by the inner-table RID. This action puts the temporary table in the same order as the inner-table rows. Join the temporary table to the inner table using the RID on the temporary table row to get the row on the inner table. This will use List Prefetch to access the inner table. Concatenate the inner-table columns to the outer-table columns in the temporary table.

If an Order By existed in the Query (that is not the same as the joined columns), then sort the resulting rows before returning them to the calling programs.

--------------------------------------------------------------------------------

 

COMPARING HYBRID, NESTED-LOOP, AND MERGE JOINS

The hybrid join is useful for joins involving a medium-sized number of rows and where a qualifying index exists on the inner table.

For larger joins, the merge scan will be used; for smaller ones, the nested-loop join technique is the most efficient. The hybrid join "sweet spot" is small, and a slight shift in data volumes can render it less efficient than an alternate join technique. This is likely the reason why database vendors--other than IBM--have not picked up on this technique.

--------------------------------------------------------------------------------

 

ANTI-JOINS--PART 1 OF 3

Anti-join describes a join that returns rows from one table that do not exist in the other. There are many ways to perform anti-joins with SQL. You can use the following expressions to do this: NOT IN, NOT EXISTS, OUTER JOIN, MINUS, and EXCEPT.

We have already covered the first three statements in previous tips. These are implemented in most major RDBMSs. The last two have more limited implementations. The MINUS keyword is not ANSI-compliant, while the EXCEPT keyword is compliant with the ANSI SQL2 standard. These will be covered in the next two SQL tips.

--------------------------------------------------------------------------------

 

ANTI-JOINS--PART 2 OF 3: USING MINUS

Some databases use the MINUS operator to return rows from one table that do not exist in another. Here is an example of a SQL statement using MINUS:

Select customer_id, customer_name from Customer MINUS Select customer_id, customer_name from Invoice;

This statement will show all Customers that have not been invoiced. Notice that when using MINUS, the number of columns and column types must be the same for the two queries. In this way, it is similar to the way that a Union statement operates. It is also limiting, since all of the same columns and values used in the statements must exist in both tables--therefore, there are many cases where the MINUS operator will not be useful.

The implementation of the MINUS operator is implemented in Oracle.

--------------------------------------------------------------------------------

 

ANTI-JOINS--PART 3 OF 3: USING EXCEPT

The EXCEPT statement is included in the ANSI SQL2 standard, although it is not implemented in many popular databases that advertise themselves as fully SQL2 compliant.

Here is an example that lists all Customers that have not been invoiced:

Select customer_id, customer_name from Customer EXCEPT Select customer_id, customer_name from Invoice;

With the above example, the number of columns and column values must be identical. The EXCEPT statement has two additional features. With EXCEPT, we can use the EXCEPT CORRESPONDING or the EXCEPT CORRESPONDING BY (column list).

With EXCEPT CORRESPONDING, all rows from the first Select statement will be shown if they are not in the second SQL statement, and the matching is done for all columns that have the same name and data type on both tables.

With EXCEPT CORRESPONDING BY (column list), all rows from the first Select statement will be shown if they are not in the second SQL statement, and the matching is done for all columns in the column list--and these must be of the same name and date type on both tables.

The EXCEPT condition therefore has some important features to compare data between two tables.

--------------------------------------------------------------------------------

 

JOINING TABLES USING INTERSECT

Tables may be joined using many approaches. The most common approach is to use a natural join, as shown below:

Select c.customer_id >From Customer c, Invoice i Where c.customer_id = i.customer_id;

Another way to accomplish the same thing is to use the INTERSECT operator. Using this approach, the above statement could be written as

Select customer_id from Customer INTERSECT Select customer_id from Invoice;

As is the case with the EXCEPT clause, the INTERSECT operator has two additional features. With INTERSECT, we can use the INTERSECT CORRESPONDING or the INTERSECT CORRESPONDING BY (column list).

With INTERSECT CORRESPONDING, all rows from the first Select statement will be shown if they are in the second SQL statement, and the matching is done for all columns that have the same name and data type on both tables.

With INTERSECT CORRESPONDING BY (column list), all rows from the first Select statement will be shown if they are in the second SQL statement, and the matching is done for all columns in the column list--and these must be of the same name and date type on both tables.

The INTERSECT condition therefore has some important features to compare data between two tables.

--------------------------------------------------------------------------------

 

DYNAMIC SQL WITH DB2

DB2 for OS/390 (now called DB2 Universal Database--or UDB) has some important features that can be used to implement dynamic SQL and improve performance. Some of these important performance features are listed below.

Use the keepdynamic bind option. This keeps dynamic SQL prepared across commits, eliminating the need to perform the relatively expensive prepare operation after every commit has been performed. Use dynamic SQL with actual values rather than host variables. This will allow the optimizer to perform more accurate cost estimates. In DB2, use dynamicrules(bind) to perform security checking at bind time rather than having this done at run-time. This eliminates the need to grant select, insert, update, and delete access to tables for individual users. With this feature, the user can be granted access to a Plan or Package. Using these features with DB2 will improve system security and performance.

--------------------------------------------------------------------------------

 

B-TREE INDEXES--WHAT ARE THEY?

All popular RDBMSs use B-tree indexes as their most common access path. B-tree stands for Balanced tree--the database attempts to balance the levels across multiple branches. This index is a hierarchical index that contains a header block (some DBMSs use the term "page" or "node" for the term "block") at the top level. The header (root) block contains keys and pointers to the intermediate (branch) blocks, which in turn contain keys and pointers to the leaf block. At the leaf level, all keys are entered with pointers (rowids) to the table data. Every row is represented at the leaf level, while every leaf block has an individual entry at the intermediate level and every intermediate-level block has an entry at the header level.

--------------------------------------------------------------------------------

 

B-TREE INDEXES--HOW ARE THEY POPULATED?

A B-tree index keeps all index entries in order. Individual entries will be kept in physically continuous blocks but may become logically continuous if a block fills up and an index split must be performed. Within a leaf block, the index entries are not kept in physical order but in logical order.

As the table and index are populated, only a header block is first populated. At this stage, the B-tree index is a one-level (and one-block) index. Once this becomes full, leaf blocks are created, and the header block is transformed to point to the first two new leaf blocks. This is now a two-level index. Once the header block again becomes full, intermediate blocks are created. The header block is transformed to point to each intermediate block, and the intermediate blocks are transformed to point to the leaf blocks. The leaf blocks still point to the individual rows on the table. At this stage, we have a three-level index, which is useful for accessing millions of rows. Once the intermediate-level blocks become full, an extra level of intermediate-level blocks can be created, making this a four-level index.

--------------------------------------------------------------------------------

 

B-TREE INDEXES--THE IMPORTANCE OF THE NUMBER OF LEVELS

Keeping track of the number of levels in B-tree index structures is important, since this can affect the number of logical and physical reads that are required to access a row of data. Accessing a single row of data using a three-level unique index and a perfectly written query requires that four blocks of data be accessed--the index header, the intermediate-level block, leaf blocks, and the table data. Once an index moves to four levels, an extra I/O may be required. When you see this occur, reorganize your indexes, which could bring the indexes back to a three-level one. In fact, indexes should be reorganized regularly to remove any index block-splitting that may have occurred.

You will most commonly see three-level indexes in your database.

--------------------------------------------------------------------------------

 

B-TREE INDEXES--WHAT ARE THEY GOOD FOR?

B-tree indexes are good for most common structured data types that do not have large numbers of duplicate values. In particular, B-trees are ideally suited for systems that have the following characteristics:

Data types are structured having types of Number and Character. The index columns have a relatively high cardinality--that is, a high degree of uniqueness. A great deal of data access is random or retrieves relatively few numbers of rows. B-trees are also ideally suited for these circumstances:

Queries that perform small-range scans of data. Large-range scans of data where the table data is ordered (clustered) by the index being used. Select access and controlled insert, update, and delete activity. On the other hand, B-trees are not ideal for the following circumstances:

Indexed columns with low cardinalities (degree of uniqueness). Unstructured data. For example, specialized data types (such as spatial data) require that alternative indexing techniques be adopted. Large-range scans of table data where the data is not ordered by the index. In this case, reading the table data and sorting it is more efficient than using a B-tree index. High volumes of insert, update, and delete activity on indexes that impact performance. To conclude, B-trees support most of our business indexing requirements, but in some cases, they may degrade performance.

--------------------------------------------------------------------------------

 

INSERT, UPDATE, AND DELETE PERFORMANCE ON INDEXES

Changing column values that are indexed by B-tree indexes affects system performance. In today's tip, we will cover the performance impact that Delete statements have on B-tree indexes.

Deleting a row that has indexed values has a relatively large impact on system performance. When a row of data is physically deleted, the index entries must be deleted as well, which means that the entry on the leaf block (which points to the data row) must be deleted. Remember that an intermediate block (assuming a three-level B-tree index) will have a single entry that contains a high key for each leaf block; if the row being deleted happens to be the entry on the intermediate block, it must be deleted and a new value must be placed in the intermediate block that corresponds to the new high value on the leaf block. Similarly, the index header block contains a single high value for every intermediate block. If the intermediate-block value that was changed was also the highest value for the intermediate block, then the header block must be changed to enter the new high value of the intermediate block.

This potential cascade-up of index-entry deletions slows performance and can even cause locking issues.

Next time, we will discuss the effect of inserting new entries in an index.

--------------------------------------------------------------------------------

 

INSERTING AN INDEXED VALUE

Inserting a row into a table incurs significant index maintenance overhead.

When an indexed row is inserted, a new index leaf-block entry must be added. This entry must be placed in the correct leaf block, since the index entries are always kept in order. Some shifting of entries in the leaf block may be needed to keep the values in that block in the correct logical sequence. If the new entry is the new high value in the leaf block, then the intermediate block (assuming a three-level B-tree index) will need to be changed to put the new value pointing to this leaf block in place of the previous high value. Similarly, if this is the new high value of the intermediate block, the header block must be changed to replace the old high value pointing to the intermediate block with the new one.

If this were not complicated enough, an insert into an already full leaf block causes a leaf-block split. In this situation, half of the leaf-block entries remain in the same place and a new leaf block will be created with the other half of the entries; the new inserted value will be placed on the correct leaf block. This causes two changes to occur on the intermediate block. First, the original leaf block may need to have the original high value replaced with the new one. Second, the new leaf block will need to be added to the intermediate block. This may also cascade-up to the header block.

In the case that an intermediate block is full and a new entry is being added, a "block split" may also occur at the intermediate-block level with similar changes being performed at the header block.

Index inserts can affect both system performance as well as locking contention. It is also easy to see why regular index reorganization goes a long way toward solving system performance problems.

--------------------------------------------------------------------------------

 

THE IMPACT OF UPDATING INDEXED COLUMNS

You can think of updating an indexed value in the same manner as performing a delete and an insert of the indexed value. An update can incur all of the overhead associated with deletes and inserts that were covered in earlier tips.

When updating columns in your system, be aware of indexed columns and don't include columns that do not change in your Update statements because of the added system overhead that may be incurred.

--------------------------------------------------------------------------------

 

THE LOGICAL DELETE

If your system performs a large number of deletes and the performance of these physical deletes (that is, performing the deletes with the SQL Delete statement) makes system performance inadequate, an alternate design approach may be used. This approach uses the concept of the "logical delete."

With this approach, whenever a delete needs to be performed, a special column (call it audit_change_type) will be updated to D. >From now on, this row will be treated as if it were not there. This can be accomplished by coding programs to ignore all rows with an audit_change_type of D or to access the table through a view that filters out these rows.

A background task that archives these rows and physically deletes these logically deleted rows can then be scheduled to run during times when the system is quiet. Table or index reorgs can also be scheduled to correspond with the archive/purge process.

This process avoids heavy index maintenance activity that deletes incur during busy system times and gives you more control over database performance.

--------------------------------------------------------------------------------

 

HANDLING A LARGE NUMBER OF INSERTS

You can use several approaches to relieve the overhead that high insert activity imposes on a system.

The first and most simple approach is to provide a lot of free space on tables and indexes that have a large number of inserts. This can alleviate some of the block-split activity that occurs on indexes and is particularly effective when insert activity is random. There is no benefit to this method when insert activity is sequential, since the high values will always go to the end of the pageset (table or index).

Another approach is to drop problem indexes before the insert activity occurs and to re-create the indexes after the heavy activity has ended. In this way, index maintenance will be avoided and the indexes will stay well organized.

--------------------------------------------------------------------------------

CLUSTERING INDEXES AND CLUSTER RATIO

Clustering indexes are indexes created on tables that are in the same order as the table data. Because of this, only a single clustering index can be created on a table--and, of course, other nonclustering indexes may also be created on a table. A clustering index may be unique or nonunique.

An example of a clustered index is the case of a Customer table where a clustering index is created on the customer_name column. In this case, the table data will be ordered by customer_name. This is extremely useful for systems that want to retrieve data by customer_name, want to use an index to access the data, and want customer_name ranges to be physically located on a small number of index and table blocks.

Database implementations of this technique vary, and it is important to understand the clustering implementation of your favorite DBMS. In upcoming tips, we will cover DB2, Sybase, SQL Server, and Oracle implementations of this tip.

--------------------------------------------------------------------------------

 

CLUSTERING INDEXES IN DB2

When a clustering index is created on a DB2 table, it is created in the same manner as other indexes. In other words, the index is implemented separately from the table data. Only one clustering index can be created on a DB2 table. Once the clustering index has been created, DB2 will attempt to place any new data in clustering order. This is possible when there is significant free space in the table block where a data row should be placed to keep it in the correct order. If the correct data block (or "page") is full, then DB2 will attempt to place the data row in a close block/page that will likely be read when Sequential Prefetch is performed along with other values similar to this.

In other words, index entries are always ordered and DB2 will attempt to order all table data, but this is not guaranteed. A table reorg will put all data back in clustered order. A table is considered to be clustered if it is 95 percent in clustering order.

--------------------------------------------------------------------------------

CLUSTERING INDEXES IN SQL SERVER AND SYBASE

Clustering indexes are implemented the same way in Microsoft's SQL Server and Sybase's Adaptive Server. They are referred to as clustered indexes. As was the case with DB2, there can be no more than one clustered index implemented on a table. The difference with this implementation is that the table's data rows are actually the index leaf pages. In other words, rather than the index leaf page pointing to a data row, the table data is appended to the clustering key on the leaf page. Because of this, the data is physically ordered in the same sequence as the index.

These indexes are excellent when many range scans will be performed in the same order as the clustered index. In cases of frequent inserts and deletes, careful thought must be given to the amount of fillfactor (free space) to avoid fragmentation as much as possible. The clustered indexes should be carefully monitored to determine when table reorgs should be performed.

--------------------------------------------------------------------------------

CLUSTERING INDEXES IN ORACLE

Oracle assigns a different meaning to the term index cluster than do other databases. In Oracle, clusters and cluster keys can be used to allow more than one table to physically place related rows into the same segment. This speeds up joins, since the data from multiple tables can be read in the same physical and logical I/O.

Index clusters in Oracle have limited use. These are useful only if the tables using the index cluster are almost always accessed together. If they are not accessed together, then table scans of a single table require more blocks to be read. These also have additional insert overhead to keep the cluster ordered properly. One other important thing to consider is that the cluster's join performance may not be a lot better than table joins. Use this feature only in extreme situations.

Oracle offers another option called index-only tables, which we will examine in our next tip.

----------------------------------------------

 

INDEX-ONLY TABLES IN ORACLE

Oracle offers the new (in version 8) feature of index-only tables (also referred to as IOT and index-ordered tables). These are similar to Sybase and Microsoft's implementation of clustered indexes, since the table data is stored in the index leaf page. This forces the data into the physical order of the index. One major disadvantage of these tables is than no other indexes can be created on them, since a rowid is not associated with the data row. One feature of these is to create a PCTTHRESHOLD, which is the percentage of rowlength to be stored in the B*-tree. The columns frequently accessed should be at the beginning of the table and have a total length smaller than this PCTTHRESHOLD value. The remaining columns will be placed in an overflow tablespace.

Index-only tables have limited use in their current implementations and should be used only when you are sure that just one index will be needed on the table and that the order of the data is very important.

----------------------------------------------

PARTITIONED TABLES--WHAT ARE THEY?

Table partitioning is the capability to implement the data from a single table across multiple datasets. This partitioning can be done in a sophisticated manner as implemented by a DBMS (database management system), or it can be application-controlled. Partitioning is usually done to improve performance in working with very large tables and to facilitate database availability and parallelism.

An example of this would be to partition a table by month so that new data will always be implemented on a brand-new dataset and the oldest months' data can be archived easily and deleted by removing the oldest dataset. This can also be used to allow database utilities to be performed on a specific partition, allowing greater database availability and greater utility performance.

Some common partitioning schemes are vertical/horizontal, hash, round-robin, and range partitions. The implementation of these as well as the major DBMS implementation of them will be covered in the next several tips.

----------------------------------------------

 

VERTICAL AND HORIZONTAL PARTITIONING

The two overriding partitioning approaches that other approaches fall within are horizontal and vertical partitioning. In vertical partitioning, a table is split by columns. In other words, some columns will be implemented in one table and other columns may be implemented in other tables. Vertical partitioning is useful when large performance, space, availability, or backup and recovery benefits will be realized. This type of partitioning is inherent to a system's architecture and design, and applications must be coded to handle this type of partitioning. It is possible but not likely that applications using this data could encounter problems because of the partitioning strategy.

Horizontal partitioning is the situation where some rows from an entity (or object) are implemented in one table, and the remainder of the rows are implemented in other tables. This is the most common type of partitioning, since an entire row is always stored together, and since many DBMSs support this type of partitioning without applications needing to be aware of it.

Common DBMS partitioning approaches of hash, round-robin, and range partitioning are all forms of horizontal partitioning that are implemented by DBMSs. These approaches are generally favored and can often be implemented transparently to an application.

----------------------------------------------

 

PARTITIONING BY DBMS

Some common partitioning schemes are implemented using standard and--in some cases--advanced DBMS features. These are the preferred approach, since they are supported by the DBMS vendor and can be implemented in a manner that is transparent to the application.

Round-robin partitioning can be found in massive parallel processing (MPP) systems and parallel server implementations, where the DBMS spreads a single table across many disks transparent to an application. As each row is inserted to a table, it is assigned the next table partition--or fragment--and this continues with each insert until the last partition is reached and the first partition is used with the next insert. This can greatly speed data inserts and can speed up certain types of select, update, and delete activity by spreading I/O across multiple datasets and disk drives.

Hash partitioning is similar to round-robin partitioning, except that a hashing algorithm is used to determine which table partition (dataset and drive) a particular row can be inserted to, read from, and updated. This partitioning method can be very fast for inserts, selects, updates, and deletes of data, since the same algorithm that was used to place data on an insert can be used to locate the data in a subsequent operation.

Range partitioning is performed when the partition is based on a key value. Ranges of key values are set up for each table partition (dataset) to determine where to locate a row. One common use of this is to partition data by date range to simplify data backup and recovery as well as data archiving. The downside of date-range partitioning is that all current activity will be performed on a single partition.

----------------------------------------------

 

PARTITIONED TABLES IN DB2

DB2 does its standard data partitioning through range partitioning, which is performed through a partitioned tablespace that a table is created in. It is therefore the tablespace that is partitioned--not the table--with each part implemented on a different dataset. A single table is created on the partitioned tablespace, and a special clustering index is created to do two things: It orders the data by the key that will be used to set data ranges, and it determines the high and low keys of each partition range.

The clustering index itself is partitioned with the same ranges as the partitioned tablespace, which allows partition independence. Partition independence provides many benefits: - Individual partitions can be locked. - Individual partitions can be taken online and offline. - Utilities can be run in parallel against individual partitions.

Partitioned tablespaces have existed in DB2 for more than ten years, and this is the most mature partitioning method in the industry.

DB2's MPP database offerings provide more complex partitioning methods than those discussed above.

----------------------------------------------

 

PARTITIONED TABLES IN ORACLE

Oracle has a partitioning implementation similar to IBM's DB2. This is a relatively new offering with Oracle's version 8 release. Range partitioning is performed on a table with each partition implemented on a different file. This partitioning benefits query parallelism as well as utility independence and system availability. Spanning indexes (that is, indexes that span multiple partitions) may be created as global or local indexes. The local option may be used for any index. This allows the index to be partitioned in the same manner as the table, which is important to provide true partition independence.

Partition independence allows an individual partition to be brought offline while others remain online, and it allows utilities to be run against multiple partitions at the same time. It has many benefits for large tables and is almost necessary for very large databases.

Oracle's partitioning is an optional item that must be purchased separately from the DBMS.

----------------------------------------------

APPLICATION/MANUAL PARTITIONING

Data partitioning can be performed without the benefit of DBMS-implemented partitioning. This is done by "manually" partitioning data. In other words, data that would normally belong to one table can be divided by row range and/or columns into multiple tables. The biggest difference between this implementation and DBMS-implemented partitioning is that the application must be written to use the proper partitions with application partitioning. So, if a single table is split into two or more tables, the application must know which table or tables to access. Database views may be used to make the implementation transparent to an application, but in most cases, inserts, updates, and deletes will need to be performed to the proper tables. Also, database referential integrity implementation must now deal with the multiple tables.

Undoing application partitioning can be difficult. After a few years, the original reasons for performing partitioning may no longer exist, but to undo the partitioning may require major application system changes. Therefore, the pros and cons of application partitioning should be carefully weighed before implementation.

----------------------------------------------

 

DEADLOCKS

Deadlocks--sometimes called the "deadly embrace"--occur when two systems try to gain access to data that the other system has locked.

For example, suppose System A has Row A of Table A locked, and System B has Row B of Table B locked. If System A then tries to acquire a lock on Row B of Table B (say, to perform an update) and System B then tries to acquire a lock on Row A of Table A, neither system can move forward in its processing. They are in a deadlock situation. The database will time out one of the two systems and allow the other system to continue.

Deadlocks usually can be avoided by having systems access tables in the same order and by trying to have data on different tables clustered in the same order. This is not always possible and deadlocks can still occur, but it will remove most of deadlocks that your system would otherwise have acquired.

----------------------------------------------

THE DATABASE MARKETPLACE--OBJECT RELATIONAL DATABASES

The database marketplace is undergoing another major shift in architecture as many vendors embrace the concept of implementing object-relational databases (ORDBMSs). This merging of relational and object database technology is the answer to the long-asked question of which architecture would win the database war--relational or object technology? Object technology is still built into popular relational databases, and this trend picked up momentum with the Informix purchase of Illustra and the introduction of Data Blades into the Informix product line. This purchase spurred Oracle to add some similar object features into its Oracle 8 offering called data cartridges, while IBM's strong ORDBMS includes relational extenders. Sybase is using its Adaptive Server architecture and Microsoft is relying on OLE-DB and ADO to gain a foothold in this market through different architectures.

ORDBMSs include all those features previously included in relational databases, along with new features such as user-defined functions, user-defined datatypes, user-defined operators, and user-defined indexing. These are combined with enhancements to SQL as introduced through ANSI's SQL-3 standard. This change, combined with the introduction of Java and XML into databases, is rejuvenating the database marketplace and positioning ORDBMSs as key components of any corporate and Internet-based system architecture.

----------------------------------------------

 

USING DATE COLUMNS IN TABLES AND SQL

When implementing date and time in your database, take advantage of built-in DateTime datatypes rather than implementing these in character fields. This will provide the advantage of using built-in DateTime functions rather than having to code your own.

SQL-92 offers four items relating to date and time: - Date includes only the date. - Time includes only the time. - Timestamp includes the date and time. - Interval is an interval between two date/time datatypes.

Standard functions that can be used are - Datetime - Datetime--for example, 2000-03-31 - 2000-03-29 Day = 2 - Datetime +- an Interval--for example, 2000-03-31 + Interval '1' Day = 2000-04-01 - Interval */ number--for example, '3' DAY * 3 = 9

We will look at the implementation of date and time datatypes by popular DBMSs in upcoming tips.

----------------------------------------------

 

DATE AND TIME IN SQL SERVER

Microsoft's SQL Server and Sybases Adaptive Server have similar implementations of date and time. The datatype used is Datetime and stores both the date and time components in it. The date includes the century while the time component stores time to the millisecond (to an accuracy of three milliseconds).

Date and time arithmetic can be performed so that one date can be added to or subtracted from another.

One unique feature of SQL Server's Datetime datatype is the case where a two-digit year is supplied. If this year is 50 or over, SQL Server appends 19 to the year and if the two-digit year is less than 50, SQL Server appends 20 to the year, assuming this to be the 21st century.

It is important to note that the default base date is January 1, 1900, and the default base time is 12:00 midnight. So, if you only insert a time to a Datetime column, the date will be Jan. 1, 1900, and if you insert only the date, the time will be stored as 12:00 midnight.

----------------------------------------------

DATE AND TIME IN DB2

DB2 provides three datatypes to support date and time: Date, Time, and Timestamp. When date and time values are to be used independently, then the Date and Time datatypes should both be used. They will use less space (seven rather than ten bytes) and have more formatting options than the Timestamp datatype. However, if date and time arithmetic must be performed together, a single Timestamp column is easier to use than two columns for date and time are. Also, for greater time precision, the Timestamp datatype wins. A Timestamp column stores time to the microsecond (six decimal places) while time only stores data to the second.

DB2 Date and Time datatypes offer more options, and care should be taken to determine how your Date and Time columns will be used before deciding which datatype to use. If you feel that date and time will be needed together, then you should opt for the Timestamp datatype.

----------------------------------------------

THE DATABASE MARKETPLACE--WHO HAS THE MARKETSHARE?

Who are the major players in today's database marketplace? According to Dataquest's surveys earlier this year, IBM leads Oracle by less than 1 percent in total RDBMS (Relational Database Management System) revenue, and both companies hover at approximately 30 percent marketshare. This includes all platforms--from NT and UNIX to mainframes, where a lot of IBM's revenue comes from. Next on the chart is Microsoft, which has about 10 percent marketshare by revenue. The next tier sees Informix and Sybase with 4 to 5 percent marketshare each.

These numbers are difficult to interpret and somewhat misleading, since the numbers change dramatically when the number of licenses sold are counted. One trend worth noting: These are the top five RDBMSs (and ORDBMS--Object-RDBMS), and IBM, Oracle, and Microsoft are the big three to watch in the next couple of years.

----------------------------------------------

DATA MODELING

Before we ever get to code SQL on an existing database, a great deal of planning must be done to ensure that the database is designed properly so that systems (and SQL) can operate on it. Much of this design work comes in three distinct steps.

First, a data architecture is done. This can be in the form of a conceptual model that covers the entire enterprise. This "blueprint" can be used to determine where new data fits in, and it can be used to help guide the detailed modelers to design data for different systems that can fit together.

Second, a logical data model is created for a system. This model must fit within the data architecture and must service all the needs of the system to be built. It is at this stage that the data modeler focuses more on "what is to be done" rather than "how to do it." This model is very detailed. All entity types and attributes are created with definitions and properties (datatype, length, scale, nullability). This data goes through a process called normalization, which we will discuss in our next tip.

Third, the logical data model is transformed into a physical data model. It is here that the physical data modeler must focus on how data is to be accessed. This individual (or group of individuals) must be familiar with the logical data model, the system to be built, and the internals of the database that the data is to reside on. This is the model of the database that will be implemented.

Performing these steps will go a long way toward preventing unwelcome surprises at development and implementation time.

----------------------------------------------

DATA NORMALIZATION--SECOND NORMAL FORM

Once data is in First Normal Form, it is said to be in Second Normal Form when all attributes that are not part of the key are dependent on the entire key.

For example, if an entity has a primary key consisting of two columns and one of the attributes is fully dependent on only one of the columns, it is not in Second Normal Form and the attribute should be moved to another entity type.

----------------------------------------------

NORMALIZATION--THIRD NORMAL FORM

An entity is in Third Normal Form if it is already in Second Normal Form and all nonkey attributes are not dependent on any other nonkey attribute. Therefore, to be in Third Normal Form, an attribute must be dependent on the key, the whole key, and nothing but the key.

To summarize, data is in First Normal Form when all entity types have a primary key, all attributes are atomic, and there are no repeating attributes on an entity type. Data is in Second Normal Form when all attributes that are not part of the key are dependent on the entire key. Data is in Third Normal Form when all nonkey attributes are not dependent on other nonkey attributes.

There are more steps in normalization (fourth and fifth normalization), but they are subtle and the first three are the key ones to know.

----------------------------------------------

DEFAULT COLUMN VALUES IN SQL SERVER

In SQL Server, a column's default value can be specified as part of the Create Table statement. For example:

Create Table PRODUCT (product_id int not null, product_category_name char(50) default 'CLOTHING')

In this example, a value of 'CLOTHING' will be set for the product_category_name column in cases where an insert is performed but this column's value is not set.

----------------------------------------------

DEFAULT COLUMN VALUES IN DB2

IBM's DB2 provides standard default values and user-defined defaults. The standard defaults are as follows: - Numeric datatype = 0 - Character datatype = Spaces - Varchar datatype = zero length string - Date, Time, and Timestamp = current date, time, and timestamp

An example of a table with defaults is shown here:

Create Table PRODUCT (product_id Integer Not Null, product_category_id Integer Default 121, product_category_name char(50) default 'CLOTHING', product_name char(50) not null with DEFAULT)

In this example, any insert performed without the bottom three columns defined will contain a product_category_id of 121, a product_category_name of 'CLOTHING', and a product_name of Spaces.

----------------------------------------------

JAVA AND DATABASES

Many of today's databases implement Java as SQL Persistent Stored Modules rather than implementing only proprietary software. In other words, rather than coding procedural SQL modules such as stored procedures or triggers in such languages as Transact-SQL (Sybase) or PL*SQL (Oracle), you can now write them in Java. This is true for databases such as DB2 UDB, Sybase Adaptive Server, and Oracle 8i.

Java provides the benefits of increased portability of code, better performance, and leveraging a language that is open, popular, object-oriented, and known by many programmers.

Java database access can be performed through JDBC or SQLJ. JavaBlend is the object-relational mapping for Java. We will explore these concepts in upcoming tips.

----------------------------------------------

JAVA AND JDBC

Java Database Connectivity (JDBC) is the Java-based Database Connectivity API for accessing relational databases. It is based on X/Open's Call Level Interface. JDBC itself is written in Java, allowing it to have the same benefits as other Java code. A proposal to extend JDBC to support SQL3 types is under development.

Many interfaces have provided for JDBC. An example of this is the JDBC-ODBC driver provided by JavaSoft. There are also database-independent drivers and drivers that support proprietary connectivity. JDBC allows Java programmers to access most major databases using the same common API in the same way that ODBC does.

----------------------------------------------

JDBC COMMON DRIVER CLASSES

Some common JDBC drivers include the following four classes: - DriverManager Class--This class creates and manages database connections through methods to manage drivers. Some of these methods are registerDriver, deregisterDriver, getConnection, and getDrivers. - Connection Class--This class has methods to prepare SQL and manage transactions. This represents each database connection. Examples are createStatement, prepareStatement, commit, rollback, and close. - Statement Class--The methods in this class execute the SQL statements. This includes executeQuery, execute, executeUpdate, getMoreResults, and getCursorName. - ResultSet Class--The methods in this class process SQL statement results and contain data that is retrieved from the database. Examples include next, getMetaData, and getCursorName.

In our next tip, we will look at an example of some JDBC statements.

----------------------------------------------

JDBC EXAMPLES: SETTING UP THE CONNECTION

To run a SQL statement through JDBC, some work must be performed to set up the database connection. Here's an example of four things needed to create this connection. - Develop a URL (uniform resource locator) to define a database named Production that will be accessed with a JDBC driver accessing an ODBC database. For example: String url = "jdbc:odbc:production"; - A JDBC-ODBC driver must be identified: Class.forName("sun.jdbc.odbc.JdbcOdbcDriver"); - Create the connection with the userid and password: Connection aConn = DriverManager.getConnection(url,"User01","pwd02"); - Create a statement object instance to execute the SQL: Statement aState = aConn.createStatement();

Next time, we'll see the execution of the SQL statement.

----------------------------------------------

JDBC EXAMPLES: RUNNING THE SQL

Once the JDBC database connection has been established, we must define the SQL statement, execute it, and close it. Here is an example of five steps that perform this.

1) Define a string containing the SQL statement: String newQuery = "Select name FROM Product"; 2) Execute the SQL query using the createStatement object instance aState, which was previously defined (in our last tip): ResultSet rSet = aState.executeQuery(newQuery); 3) Get a row of data using the "resultSet next" method, which moves a pointer to the next row of data (similar to a cursor) and returns a boolean value of True if a row exists: boolean more = rSet.next(); 4) Move the data to a String (previously defined) named vProductName: vProductname = rSet.getString(1); 5) Close the connection and statement: rSet.close(); aState.close(); aConn.close;

----------------------------------------------

 

JDBC--PREPARING SQL

When executing SQL multiple times using different bind variables, the statement should be prepared so that it can be reused when it is run multiple times. Here is an example of this:

1) Define a string containing the SQL statement: String prepNewQuery = "Select name FROM Product WHERE productId = ?"; 2) Prepare the statement to be run: PreparedStatement prepState = conn.prepareStatement(prepNewQuery); 3) Set a variable to the product_id to be searched: vProductId = 100009; 4) Set the variable in the statement to be the same as the one above. This is the first and only variable in the statement: prepState.setBigDecimal(1,vProductId); 5) Execute the prepared statement: prepState.executeQuery();

Preparing a statement in this manner will deliver performance benefits, since the SQL statement will not need to be parsed, prepared, or optimized for every value of productId.

----------------------------------------------

 

JDBC AND STORED PROCEDURES

SQL Persistent Stored Modules, such as stored procedures, can be executed through JDBC. To do this, the CallableStatement class must be used. An example is shown below, where the getProductName stored procedure is executed:

1) Prepare the call of the stored procedure: CallableStatement callState = conn.prepareCall("{? = call getCustName(?)}"); 2) Register the output parameter that will contain the name: callState.registerOutParameter(1,Types.Varchar); 3) Set up the input variable that will be passed to the stored procedure using an already defined and set variable for the customer number: callState.setInt(2,vCustNumber); 4) Execute the stored procedure: callState.executeQuery(); 5) The output string can be retrieved by moving the following to a string variable: callState.getString(1);

As you can see, the advantages of database-stored procedures can be maintained when performing JDBC database access.

----------------------------------------------

SQLJ AND JAVA

SQLJ is the ANSI/SQL standard for coding embedded SQL in the Java programming language. This interface was developed by several companies, including IBM, Oracle, JavaSoft, Sybase, Informix, Compaq, Microsoft, XDB, and Cloudscape.

This standard allows for the extension of Java programs to include static SQL. SQLJ translators are utilities that convert SQLJ statements into Java code that can access a database through a call-level interface. This code can then be compiled by a Java compiler.

SQLJ supports only static SQL. To run dynamic SQL, JDBC must be used. SQLJ has mechanisms to enable Java to switch easily between the JDBC and SQLJ environments. SQLJ therefore leverages many of the benefits of JDBC.

----------------------------------------------

SQLJ SYNTAX

Some basic SQLJ syntax is shown below. SQLJ statements have the following characteristics: - They start with #sql and terminate with ; - Host variables begin with : - Text is enclosed in braces {}

Here is an example of a simple SQLJ statement:

#sql {SELECT name into :varName FROM Customer WHERE customer_id = :varCustId};

----------------------------------------------

SQLJ ADVANTAGES

The SQLJ interface is similar to other call-level interface languages and requires less code and is easier to write than JDBC. Because SQLJ allows only static SQL, it has the advantage of performing compile time statement checking rather than incurring run-time checking as JDBC. There is also stronger type checking with SQLJ than there is with JDBC.

One disadvantage of SQLJ is its inability to include dynamic SQL in the native format. However, dynamic SQL can be included through the JDBC interface, which may coexist with SQLJ in the same Java environment.

SQLJ is implemented in many relational databases, such as IBM's DB2 UDB, Oracle 8.1, and Sybase Adaptive Server Anywhere.

SQLJ syntax is more concise than JDBC, as this example, written in SQLJ, shows:

int varCustId; #sql {INSERT into Customer VALUES (:varCustId)};

Here is the same statement written in JDBC:

int varCustId; Statement stmt = conn.prepareStatement (INSERT into Customer VALUES (?)); stmt.setInt(1,varCustId); stmt.execute (); stmt.close();

----------------------------------------------

SQLJ 2.0 FEATURES

The SQLJ 2.0 upgrade implements many new SQL3 features that deliver object benefits to relational databases.

Some new database types included in version 2.0 are user-defined data types, object data types, and Java classes. Support for SQL3 data types such as BLOBs, CLOBs, arrays, references, and structured and distinct types are now included in version 2.0. SQL3 ADTs (abstract/user-defined data types) and metadata for user-defined types have added to the specification. Java classes can also be added as database types.

Another great feature is persistent Java objects stored in databases. This feature allows us to write SQL Persistent Stored Modules (such as stored procedures and triggers) in Java rather than in proprietary languages such as PL/SQL or Transact/SQL.

----------------------------------------------

SQLJ ITERATORS

In application programs that are written in languages other than Java, cursors can be used to return rows from a table to the program. These cursors are a type of pointer that allow an application program to point to a