|
For best results: this site requires that cookies be enabled for proper operation - see Legal Page for more info
|
|
Select Any of These |
SQLLAST UPDATED: 08 March 2009 17:02:10 -0600
IN GENERAL, FAVOR DATABASE JOINS OVER APPLICATION LOGICA popular misconception when processing database data is that it is more efficient to merge and filter data in a program rather than filtering and joining the data in a database--in SQL. This misconception would have SQL statements that include only a small number of tables and incomplete filtering criteria in each. They would return rows to a program that would then merge, filter, and process them. Moving data across database address spaces to application address spaces (and in more extreme cases, across the network) consumes a large amount of system resources. It is often more efficient to perform much of the work in the database and return fewer rows to the application program. Where possible, perform data filtering and joining in the database. The coding will be easier and performance often will improve. Of course, this needs to be analyzed and tested on a case-by-case basis.
INSERTS, UPDATES, AND DELETES--AND VIEWSA common popular misconception is that Inserts, Updates, and Deletes cannot be performed on Views. This is only partly true. These statements generally cannot be performed on Views that are "Join Views." An example of a Join View is shown by the Join Select statement below within the "Create View" syntax.
Create view Party_Member_View as Select column1, column2, column3 From Party A, Party B Where A.party_id = B.member_party_id; DB2 and SQL Server will not allow Inserts, Updates, and Deletes on this View. Oracle also will not allow it as default behavior, but "Instead of" triggers can be created to allow the statements to be processed.
HOW MANY TABLES SHOULD BE JOINED IN ONE SQL STATEMENT? The answer used to be three--maybe four. This has changed. The real limits are those of your DBMS and the performance requirements of the system you are implementing. It is possible to have a large number of tables joined that perform well if only a small number of rows are processed. It is more important to understand the data being processed and overall system performance requirements when determining the best way to code joins. Arbitrary rules that limit the number of tables included in a join are meant to protect system performance, but, in fact, may have the adverse effect. It is more important to consider and analyze each SQL statement than it is to have these types of rigid rules. Write your join the way you naturally would like to, then run Explain (DB2 and Oracle) or Showplan (SQL Server) on it and test the query on a reasonably sized database to ensure that it works as expected.
THE USE OF "NOT" IN A QUERY A popular misconception is that the word NOT should not be used in the Where clause of a query. This idea stems from the fact that the statement NOT is often not indexable by many databases today. For example, a case where this is true is the following SQL statement, where the column "customer_name" is indexed: Select * >From Customer Where customer_name NOT like 'JONES%'; In this example, the index on customer_name will not be used. But notice the following SQL statement, where customer_id is indexed: Select * >From Customer Where customer_id between 100000 and 120000 and customer_name NOT like 'JONES%'; In the above example, an index on customer_id may be used while the NOT statement also eliminates some rows. The NOT statement can be useful in eliminating rows at the database level.
MORE USES OF "NOT" IN A QUERY In some cases where only indexed columns are needed, the NOT statement can use an index. An example of this is shown below. Consider the case of an index where two columns, customer_id and customer_name, are indexed in one index, and these are the only columns needed in a query. The SQL looks like the following: Select customer_id, customer_name >From Customer Where customer_name NOT like 'JONES%'; In this case, the index with both columns may be scanned if the number of blocks/pages to be read is considerably less than if the table were read.
AVOIDING "NOT" IN A QUERY In some cases, the NOT statement should be avoided in a query. These situations crop up when it might be easier to write the statement as NOT, but you know that index access will be much more efficient when a positive, more complex statement is chosen. Consider the following situation: You want to see all customers outside of New York City when you know that 99% of customers are in New York, and you know that you want to use an index to perform this search. The simplest statement to write is Select * >From Customer Where City NOT like 'NEW YORK'; This will scan the entire table, while returning only 1% of the rows. The statement can be rewritten in a fashion similar to the following: Select * >From Customer Where (City between ' ' and 'NEW YORJ') or (City between 'NEW YORKA' and 'ZZZZZZZZZ'); This should be tested with an Explain or showplan to see if the index is used. If not, consider the use of host variables as arguments in the query.
BITMAPPED INDEXES An interesting and useful type of index that is little understood is a 'bitmap' or 'bitwise' index. This is implemented in most popular relational databases as a standard option. It works a little differently than the standard B-tree index. With the bitmap index, a bit of 0 or 1 is set for existence and non-existence of a value. A good example of using a bitmap index is a column that is a flag with values Yes or No. Assume ten rows with six values 'Y' and four values 'N'. The columns look like this: YNNYNYYYYN The bitmap index looks like this: for value 'Y' 1001011110 for value 'N' 0110100001 So, separate bitmaps are stored for each distinct value of a column, and compression is used to provide efficiency with this type of index. This is a valuable option to consider for speeding up SQL queries. In a future tip, we will look at when bitmap indexes should be used.
BITMAPPED INDEXES--WHAT ARE THEY GOOD FOR? Bitmap indexes as described in yesterday's tip have the following advantages: · They are very efficient for queries on large tables with relatively few distinct values and high query access. · They are quick on aggregate (count) queries. · They can be used in any order and in any combination, compared to concatenated indexes. · They are very compact due to compression (more compact than concatenated B-tree indexes). · Merging of multiple bitmaps is very fast. · 'Star joins' can make great use of bitmap indexes. · Many databases can use bitmap and B-tree indexes in the same query. Bitmaps have many advantages. Look up their use in your database of choice.
BITMAPPED INDEXES--WHAT ARE THE DISADVANTAGES? Bitmaps should not be used in the following situations: range queries or columns with a high number of distinct values (though this limit is increasing due to compression techniques) cases where locking is required at the row level. Oracle (for instance) cannot lock a single bit, so locking is at the block level for a bitmap index and--due to compression and the high density of bitmap indexes (one of its strengths)--thousands of rows can be locked by a single seemingly efficient SQL insert, update, or delete statement. any tables where the bitmapped column has a high number of insert, update, or delete statements against the bitmapped columns range queries, which cannot take advantage of bitmap indexes Please review the specifics of bitmap indexes as supplied by the database and release that you are running to fine-tune these tips. -------------------------------------------------------------------------------- PERFORMING A UNION The Union statement allows the values of two SQL statements to be combined. An example of Union is shown below: Select customer_id, customer_name >From Customer Where customer_name like 'JONES%' Union Select customer_id, customer_name >From Customer Where customer_name like 'SMITH%'; The above statement could be coded as an 'OR' in a single SQL select, but the Union helps to simplify and optimize some types of SQL statements. In a Union, the number of columns and data types must be the same, though the table names and column names can be different. Union is not the same as Union All. With a Union All, all rows selected by all queries are retrieved. In contrast, the Union statement strips out duplicate rows and returns only distinct rows from all queries in the Union statements. --------------------------------------------------------------------------------
PERFORMING UNION WHERE COLUMNS DO NOT MATCH With the Union statement, the number of columns and data types of each must match. In other words, the following is an invalid statement, since the first Select in the Union has more columns than the second: Select col01, col02, col03, col04 >From Table1 Union Select col06, col07 >From Table2; How can this be corrected when Table2 has only two columns? If col03 and col04 are of data type char(4), the following can be used: Select col01, col02, col03, col04 >From Table1 Union Select col06, col07, ' ', ' ' >From Table2; The data types and numbers of columns in both Select statements are the same, making this statement legal. In other cases, NULL values may be used and data type conversions may be needed to write a legal Union statement.
GETTING A ROW WITH THE MAX VALUE Finding a row in a table with a maximum value in a certain column seems to be a simple thing to do. Let's say, for example, that we want to retrieve all columns in the Sales table for the row with the largest sale amount as defined in column sale_amount. This needs to be written as a subquery where the subselect finds the row with the max(sale_amount) and the outer query then retrieves all columns for that row. Assuming the sale_id is unique in the Sales table, this query can be written as follows: Select s1.sale_id, s1.col02,...s1.coln, s1.sale_amount >From Sales S1 Where s1.sale_amount in (Select max(s2.sale_amount) >From Sales S2); This would retrieve only the single row with the maximum sale_amount in it. -------------------------------------------------------------------------------- SOME INDEXING TIPS Consider the following indexing tips when you are optimizing your SQL statements. · For oltp (online transaction processing with a high number of insert, update, and delete statements), indexing will almost always be B-tree. · Limit indexing on columns that are frequently changed (insert, update, delete). · If an Update statement does not change columns that are indexed, then the high cost of index maintenance is avoided. · Question cases where tables with high change activity have more than three indexes. This is not necessarily a problem, but it is a rough rule-of-thumb that can be used to highlight potential problems. Tomorrow we will look at some more indexing tips.
MORE INDEXING TIPS More things to think about when you are trying to speed up your SQL through the use of indexes are: · Make sure that all table access is supported by indexes (unless the table is small, such as fewer than six blocks). · Reorg/rebuild and 'runstats'(DB2)/'analyze... compute statistics' (oracle) or 'update statistics' (SQL Server) indexes very frequently. · The order of columns in indexes is very important: most commonly used columns at front of index and cardinality (uniqueness) of leading columns should be high. · When joining tables, if the join is performed using more than one column with a concatenated index, then the join will perform more efficiently if the index columns are in the same order on both tables. Performance will be even further enhanced if the Table data is in the same order as those indexes (if the cluster ratio is high).
SEARCHING FOR AND COMPARING NULL VALUES Null values are a controversial and often discussed item in database theory. To simplify this matter, the issue around a Null value is that we are never sure if a column is Null because it does not apply or because it simply is unknown. In other words, a Null value can be interpreted in many ways. Because of this, one Null value can never be considered to be equal to, not equal to, greater than, or less than another, because we cannot compare one indeterminate value to another. Therefore, the following statement will never return a value--even if the customer_name is NULL: Select * >From Customer Where customer_name = ''; When comparing Null values, use the 'IS NULL' value rather than ''. The following corrects this statement: Select * >From Customer Where customer_name IS NULL;
NULL INDEXES Be careful of index entries that are completely null--an index entry may not be kept. So, queries using IS NULL logic will be forced to perform a table scan. This is true in Oracle, and you should verify that it will not cause an issue with your database and application.
USE PROPER SEARCH COLUMNS IN WHERE CLAUSES Where clauses in a Select statement should include the leading (at least the first) columns of a concatenated index. Consider an index that has three columns: last_name, first_name, middle_initial. If the majority of queries will use at least the last_name column, then this should be the first column of the index. An example of this Select statement is shown below: Select last_name, first_name >From Customer Where last_name = 'JONES'; If the last_name column were not the first column in the index, then either an entire index scan or table scan would need to be performed. With last_name as the first column of the table, then a matching index lookup may be performed. Consider how the SQL should be written when creating indexes and, conversely, consider the order of columns in an index when writing your SQL.
USING CONCATENATED INDEX PROPERLY You need to create only one index on a table that includes all columns needed. Queries needing a subset of the columns will use this index as long as the leading columns are used as predicates in the query. For example, if an index has columns column01, column02, and column03, then any of the three following queries will use this one index effectively: Select * >From table_name Where column01 = :host_variable1 and column02 = :host_variable2 and column03 = :host_variable3; Select * >From table_name Where column01 = :host_variable1 and column02 = :host_variable2; Select * >From table_name Where column01 = :host_variable1; In fact, if the cardinality (degree of uniqueness) of column01 is very high, then the following query may also use this index effectively: Select * >From table_name Where column01 = :host_variable1 and column03 = :host_variable3; Notice, however, that if the query used only columns column02 and column03, then an index on column02 and column03 may be needed. Also, if another query using only column03 were created, then a third index with only column03 would need to be created--or alternatively, a single index could be created in order for column03, column02 to satisfy both queries. Note that at most three indexes would need to be created, each with a different leading column and each with fewer columns. In many cases, the creation of only one index will solve your performance problems.
WHEN TO USE AN INDEX Indexes will not solve all performance problems. In fact, databases have very efficient built-in features to scan entire tables. As a (very) rough rule of thumb, if more than 20% of a table is being read, then a table scan will often be faster than an index-based lookup. Some database features include read-ahead or prefetch capabilities, where the next set of blocks is read from physical storage before it is needed by the database. Another high performance feature is the ability to perform parallel queries by breaking up a large query into several small ones. In general, consider creating an index for performance reasons if a query returns a small number of rows and the query is run frequently.
COMPARING DIFFERENT DATA TYPES When performing a join of two tables, make sure that the data types of the join columns match. If the data types do not match, then several things may happen (depending on the database, type of query, and data type incompatibility issues being faced). · A join operation that should use an index may not use the index. · The statement may not be valid and may return a compile or runtime error. · The absolute case is that the statement may run and return invalid results. Make sure that columns that are to be joined are of the same data type, length, and precision (number of decimal places).
WHY DOES MY QUERY RETURN NO ROWS? You run a very simple query and return no rows. This makes no sense to you. You have run controlled tests, and the data in your Where clause is exactly the same as the data in the database. You also know that all of the data is uppercase, so that is not an issue. Your pseudo-code looks as follows: declare :host_variable char(20); set :host_variable := 'JONES'; Select * >From Customer Where customer_name = :host_variable; This returns no rows even though you know that 'JONES' is in the database. Check the data type of customer_name. Depending on the database you use, there is a good chance that the data type is varchar (varchar2), and the trailing spaces cannot be matched. Make sure that your data types match those of the database exactly.
CREATING A TABLE QUICKLY You want to create a copy of a table quickly using the Create Table table_name As Select... statement, but you want only the table structure. In other words, you do not want to move all of the rows from the source table to the target table--you want the target table to be empty. How do you accomplish this? Use the following statement: Create Table new_table as Select * from old_table Where 0 = 1; The Where 0 = 1 is false and ensures that no rows will be found from the source table. The new_table will be created empty.
CURSORS AND WHY THEY ARE NEEDED The SQL language has been designed to work on sets (tables) rather than on one row or record at a time. This does not match the structure of most programming languages such as COBOL, C, C++, and Java, which operate on one row at a time. Host variables can handle only one value at a time, and table values cannot be read directly into arrays. To resolve this mismatch, an SQL feature called Cursors was created to provide a bridge between the database 'set' structure and the programming language 'row-at-a-time' one. It is, in effect, a pointer to the row you want processed in a table. A named cursor can be declared for an SQL statement as follows: Declare Cursor Read_Customer for Select last_name, first_name >From Customer; This cursor can then be opened with the 'open cursor' statement and navigated or processed one row at a time with the Fetch statement, which is performed repetitively until all required rows have been read. The cursor is then closed.
OPEN CURSOR STATEMENT In a procedural language such as C, COBOL, or Java, SQL statements can be processed through cursors (as seen in our previous tip). Once a cursor has been declared, it can then be opened through the Open Cursor statement. An example of this is shown below: Declare Cursor Read_Customer for Select last_name, first_name >From Customer; Open Cursor Read_Customer; The rows aren't returned to the program. This statement prepares the program to return rows to it by the next Fetch statement. If a sort is required to return rows, then the sort operation may be performed at Open Cursor time before the first Fetch is performed. Open cursors only when they are needed. Do not automatically open all cursors that might be used as a default action.
FETCH STATEMENT AND SCROLLING Once a cursor has been declared, a pointer can be set to the first row to be returned, and all remaining rows can be returned through the Fetch statement. The Fetch positions a pointer to the next row of a result set and returns the single row to the calling program. Program host variables are used to return the individual columns, as shown here: Fetch Read_Customer into :last_name, first_name; The above Fetch gets two columns and puts them into the :last_name and :first_name variables in the program. These are preceded with colons to distinguish the host variables from table columns.
CLOSE CURSOR Cursors are explicitly closed when a program completes execution. However, close a cursor as soon as you've finished with it. This will free valuable system resources. Also, only close those cursors that have been opened. The Close Cursor statement is shown here: Close Cursor Read_Customer; Do this rather than letting the program completion implicitly close a cursor.
OPEN CURSOR WITH HOLD (DB2) In DB2, open cursors are implicitly closed when a Commit is performed. This causes problems with reopening and repositioning a cursor when the next Fetch statement is processed. To keep a cursor open and maintain cursor position for Fetches across Commit statements, use the With Hold option when the cursor is declared, as follows: Declare Cursor Read_Customer with hold for Select last_name, first_name >From Customer;
PERFORMING UPDATES WITH CURSORS When you know that a cursor subsequently will be used to update rows, use the For Update Of option when declaring the cursor. This prevents other programs from changing data between the time you read a row and the time that you perform the update. Also, only name the columns to be updated when specifying For Update. An example of this is shown here: Declare Cursor Read_Customer for Select last_name, first_name >From Customer For Update first_name;
THE WHERE CURRENT OF CLAUSE When either updating or deleting rows that were initially read through a cursor, use the Where Current Of clause to state that you want to update or delete the row that the cursor is pointing to. For example: Update Customer set first_name = 'David' where current of Read_Customer; In this example, a row was read (Fetched) by the Read_Customer cursor and subsequently updated by the statement above.
CURSORS AND SQL SERVER Microsoft SQL Server provides the capability to create cursors using ANSI SQL or Transact-SQL syntax. The two cannot be used together. When possible, use the ANSI SQL syntax, since this will provide more code portability across databases and across versions of SQL Server.
STATIC VERSUS DYNAMIC SQL There are two basic types of SQL statements: Static SQL and Dynamic SQL. Static SQL is the most common and is used when the SQL statement can be written with hardcoded values or with host variables. The SQL statement and Where clause are well understood and can be determined before the program is run. In some cases, however, it is difficult to determine the best way to write a SQL statement before a program is actually run, because it can be difficult to cover all the combinations of SQL and resulting program logic that need to be written for some situations. An example of this is a window that allows users to input many combinations of search criteria on the screen--for example, first name, last name, city, street, postal (zip) code, or any combination of these, including range searches. A good solution to this type of problem is to build the SQL and Where clause dynamically, based on the values entered on the screen. This Dynamic SQL can benefit both program structure and performance when used in the correct way. There are four types of Dynamic SQL: execute immediate; non-select; fixed-list select; and varying-list select. The next several tips will discuss how to build Dynamic SQL.
DYNAMIC SQL WITH EXECUTE IMMEDIATE The simplest way to code Dynamic SQL is to use the execute immediate feature. All that needs to be done to perform an execute immediate is to place a SQL statement in a host variable and execute it using the execute immediate syntax. One big disadvantage of execute immediate statements is that they cannot use input or output parameters, so this statement cannot be used for a Select statement. It can be used for Insert, Update, and Delete, however, and it can be used to perform administrative functions such as security grants and revokes. An example of this is shown here: set v_sql_statement = "delete from customer where last_name = 'JONES'" exec sql execute immediate :v_sql_statement end-exec The execute immediate statement is especially good for simple, one-time tasks. For statements that must be run many times, it must be "Prepared" for every execution. In other words, the overhead of Prepare syntax checking and parsing must be performed with every single execution.
NON-SELECT DYNAMIC SQL: PREPARE AND PARAMETER MARKERS Non-Select Dynamic SQL offers the same type of functionality as execute immediate, with one important advantage: the Prepare and Execute statements can be kept separate. This allows a statement to be Prepared once and then executed many times within a loop. The Prepare is performed only when the statement is actually issued, avoiding unnecessary overhead (compared to the implicit Prepare that is performed by execute immediate). These types of statements can use a feature called a parameter marker, which allows for different values to be placed into the Where clause of an already Prepared statement for different executions. For example: set :v_sql_statement = "Delete from Customer where customer_id = ?" exec SQL Prepare customer_delete_statement from :v_sql_statement; loop until condition_reached set :v_cust_id = 1; exec SQL customer_delete_statement using :v_cust_id; end loop These statements show how the Prepare is performed outside a loop and how the Where clause predicates can be set inside the loop for each execution of the already prepared statement. This is accomplished through the parameter marker "?" and the variable v_cust_id (which is used to plug a value into the statement at execution time). We've covered two important features of Dynamic SQL: the Prepare statement and parameter markers.
DYNAMIC SQL: FIXED-LIST SELECTS AND CURSORS The coding of this feature builds on the concepts used in the Non-Select Dynamic SQL we discussed in our previous tip. We will once again take advantage of the Prepare statement and parameter markers. We will, however, need to provide the capability to handle output variables. These statements are relatively simple, since the number of input and output variables are known. Unlike our previous examples, we will be able to use Select statements with all of the capabilities provided here. Consider the case of a SQL statement with two input variables that returns two output values and multiple rows. In this case, we need to write the SQL statement with parameter markers for the input variables. We can then Prepare the statement, and, once the Prepare has been performed, a cursor can be created using the Prepared statement. The cursor is needed to accommodate the multiple rows that were returned. A Fetch is then performed using the cursor inside a loop, and the results are returned to host variables. An example is shown here: set :v_sql_statement = "Select last_name, first_name From Customer Where last_name = ? and first_name = ?"; exec SQL Prepare Read_Customer from :v_sql_statement; exec SQL Declare Cursor Customer_Cursor FROM Read_Customer; set :v_in_last_name = 'SMITH'; set :v_in_first_name = 'JOSEPH'; exec SQL Open Customer_Cursor using :v_in_last_name, :v_in_first_name; loop until no_more_rows Fetch Customer_Cursor into :v_out_last_name, :v_out_first_name; end loop The syntax used by your favorite database may differ from that used here, but the steps performed will be the same.
VARYING-LIST DYNAMIC SQL AND THE DESCRIPTOR AREA The most flexible--and most complex--type of SQL to write is varying-list SQL, where the number of input parameters and output parameters is unknown. To accomplish this, we need to take advantage of a SQL Descriptor Area that describes the SQL statement as well as the number of values it inputs and returns. To create a Descriptor for a statement where 15 parameters may be needed for input and output variables, code the Allocate Descriptor statements as follows: Exec SQL Allocate Descriptor in_customer_desc_area with max 15; Exec SQL Allocate Descriptor out_customer_desc_area with max 15; To Describe Input and Output parameters, use the Describe Input and Describe Output statements, along with a Prepared SQL statement, as follows: set :v_sql_statement = "Select last_name, first_name From Customer Where last_name = ? and first_name = ?"; exec SQL Prepare Prepared_Read_Customer from :v_sql_statement; Then, input and output parameter information can be retrieved by analyzing the Prepared SQL statement with Describe Output and Describe Input commands, as shown: Exec SQL Describe Output Prepared_Read_Customer Using SQL Descriptor out_customer_desc_area; Exec SQL Describe Input Prepared_Read_Customer Using SQL Descriptor in_customer_desc_area; These statements put information about the variables into their respective SQL Descriptor areas. Their values must then be supplied using the Set Descriptor statement. To set the values, supply the Descriptor name, the number of parameters, and the data value. For example: Exec SQL set Descriptor in_customer_desc_area value 1 DATA 'SMITH'; Exec SQL set Descriptor in_customer_desc_area value 2 DATA 'JOSEPH'; The SQL statement can then be run as follows: Exec SQL Execute Prepared_Read_Customer Using in_customer_desc_area INTO out_customer_desc_area; Again, this syntax may not match the syntax of your database of choice, but the steps performed will be the same. Describing SQL and working with varying lists of input and output values can be complicated, and you should fully understand the exact implementation by your DBMS before you decide to use this feature. An example of how different syntax is used to support this feature is the SQLDA (SQL Descriptor Area) approach that is used by IBM's DB2.
DISADVANTAGES OF DYNAMIC SQL Dynamic SQL has the advantages of providing a great deal of flexibility in coding SQL and simplifying program logic when many combinations of input variables are possible. However, there are some disadvantages to consider when deciding whether to use Dynamic SQL. Some of these are: Dynamic SQL can be difficult to code and error-prone. Understand it completely before using it in a production program. Security can be an issue if the end user is given direct access to a table rather than an application or Binding ID. Some DBMSs get around this through SQL checking performed at Bind time rather than run-time. Performance can suffer from the extra steps that may need to be repeatedly performed at run-time. Some of these include steps to compile, parse, bind, optimize, and perform security checking of the statement. Consider carefully whether Dynamic SQL is right for you. Almost all of your production SQL will not require it. --------------------------------------------------------------------------------
WHAT IS A CLUSTER RATIO? Cluster ratios are an important factor to consider when analyzing indexes and table access. The cluster ratio is the percentage of table data that is sorted in the same order as a specific index. An example of this is shown by two indexes on the Customer Table. Index1 is on the Customer_id, which is in random order and has a cluster ratio of 40 percent. So, the table is not ordered by Customer_id. This is not a problem, since all access through this index is a direct read and does not access many rows in a single query. The second index, called Index2, is on the column Customer_Name and has a cluster ratio of 100. In other words, the Customer Table data is sorted in Customer_Name order. This is important, because a large number of searches by Customer_Name use wildcards (such as "Select * from Customer where Customer_Name Like 'SM%'"), and it is valuable to access index entries adjacent to each other and data entries that are also adjacent to each other in the same order. This will, in fact, lower the number of index and data blocks that need to be read to satisfy a query on Customer_Name and will improve performance. Consider cluster ratios and the ways that your tables are queried when deciding how table data should be ordered. --------------------------------------------------------------------------------
GENERATING SQL WITH SQL In some situations, you will need to run quick SQL on a database--but you don't have the SQL at your fingertips. Some functions can be generated through a SQL query using imbedded literals in a SQL statement. For example, let's say that you are asked to grant Select access to all of your tables to another user in the system. In other words, issue the Grant Select to userid_of_choice on TABLE statement. This can be generated quickly in two steps: 1. Run SQL against the database catalog using embedded literals. 2. Use the output from step 1 as the SQL input to step 2. Here's an example of generating this in Oracle: SELECT 'grant select on ' || table_name || ' to userid_of_choice;' From user_tables; A similar approach can be used in DB2 using the systables catalog table and in SQL Server and Sybase using sysobjects. --------------------------------------------------------------------------------
CARTESIAN PRODUCTS An unfortunate result of SQL Joins with incorrect Where clauses is the cartesian product. This can occur when two tables are joined without using a Where clause. Consider a case where an Employee table with 250 rows is joined to a Department table with 15 rows. If performed properly, the result set will include 250 rows, since each employee belongs only to one department. The correct SQL would look like this: Select emp.*, Dept.* >From Employee emp, Department dept Where emp.department_id = dept.department_id; A cartesian product is created when the Where clause is removed from this statement so that it is coded as follows: Select emp.*, Dept.* >From Employee emp, Department dept; With this SQL statement, every Employee row will be joined to every Department row. This must occur, since it is impossible for the database to determine how the rows will be matched up. The statement will therefore return 250*15 rows, for a final incorrect result set of 3,750 rows. This number isn't too huge; there are cases where millions of rows can be returned. This incorrect result of every row from one table being joined to every row from another table is known as a cartesian product. It is also the source of many hilarious jokes between DBAs, such as: "What did you do? Create a cartesian product?!?!?"--unbelievably funny. --------------------------------------------------------------------------------
ODBC--WHAT IS IT? Microsoft's Open Database Connectivity (ODBC) is a widely used method to allow applications and tools to connect to different database types using a common API and database calls. This is based on X/Open and ISO Call Level Interface (CLI) specs that separate database connectivity into multiple layers. They can be thought of simply as a server-side and a client-side layer. This allows vendors to plug different database interfaces on the server side and different client interfaces on the client side. ODBC allows for one set of interfaces to the database, one set of database access syntax, one set of data types, and one set of error codes by your applications to different database types. --------------------------------------------------------------------------------
FOUR THINGS THAT MAKE UP ODBC Open Database Connectivity (ODBC) consists of four layers: 1. An ODBC application, which calls the ODBC interface. Apps can be written in tools such as Visual Basic and PowerBuilder. They connect to the database, issue SQL, retrieve results from the database, check return codes, and disconnect from the database. These call ODBC functions that submit SQL requests and retrieve results. 2. The ODBC driver manager is called by the application, and this DLL in turn loads the appropriate ODBC driver to be used for this app. It routes function calls from the applications to the proper ODBC driver. 3. ODBC drivers are specific to a database. There are ODBC drivers for all major relational databases, such as SQL Server, Adaptive Server, Oracle, DB2, UDB, and Informix. Many nonrelational databases also support ODBC drivers to overcome their lack of SQL support. The ODBC driver processes ODBC function calls, submits requests to the database, and returns results to the driver manager. 4. The data source is the actual database to connect to. This would specify the exact database to connect to and the userid to connect with. In other words, a single ODBC driver may be used to set up many data sources. --------------------------------------------------------------------------------
BE CAREFUL OF POTENTIAL ODBC OVERHEAD ODBC is slower than native APIs for connecting to most databases. When not accessing a Microsoft database, use the native APIs, when possible. For example, with Oracle, ODBC sits on top of Net8 or SQL/Net and adds an extra layer to database access and connectivity. For Microsoft SQL Server, however, ODBC sits at the same layer as DB-lib on the network stack, and the extra layers of code are avoided. This is the most popular database driver with SQL Server, but it will be surpassed by OLE DB, which is still relatively new. --------------------------------------------------------------------------------
ODBC 3.0 ODBC 3.0 is the latest version of ODBC. It has new functions to simplify connections, initialization, and processing of server-side code such as stored procedures. When obtaining ODBC drivers from vendors, inquire about the level of ODBC compliance that their drivers support. Compliance with ODBC 3.0 may solve many problems that you encounter with ODBC. -------------------------------------------------------------------------------- WHERE CAN I GET ODBC DRIVERS? ODBC drivers can be purchased or obtained from many sources. They come with most major databases and can be downloaded from most major database Web sites such as IBM, Microsoft, Oracle, Sybase, and Informix. Third-party products are also available from companies such as Intersolv (now called Merant). Performance of different products varies, so test the different drivers if this is a concern. --------------------------------------------------------------------------------
PARALLEL QUERY Many of today's popular databases support a feature that allows database access to be performed in parallel. The specifics of how each database does this differ between implementations, but the basics are the same. In the case of a query that is accessing a large number of rows, the Parallel Query Server will split the query into multiple parallel streams that read separate partitions of data. These partitions may be physical in that they are separate physical files. The partitioning may also be logical in that the database engine separates the data into two or more ranges of keys that span a single physical file. These parallel processes that read the data may be thought of as "producers," in that they gather the data and pass them on to the next stage. The second stage of a Parallel Query implementation is where multiple query processes can perform sorting and other similar operations in parallel. These can be thought of as "consumers" of the data. The final step is where a single Parallel Query coordinator gathers the data from the multiple query process consumers and merges all rows from the two or more consumer engines and returns the results to the application that issued the original query. This processing can make large queries and table scans perform very quickly. The SQL usually does not need to be altered by the developer to take advantage of this feature, although Database Hints can sometimes be helpful. However, the database designer--or DBA--can assist this process by partitioning tables and implementing a database on a multi-CPU SMP machine or on an MPP-based architecture. --------------------------------------------------------------------------------
INDEX MERGES In some situations, more than one index can participate in a query through multi-index access. Consider the case of a Customer table that has two indexes. The first contains the column customer_name and the second contains the column postal_code. If a query needs to be written to return rows for anyone with a last name of SMITH or with a postal code of A1A1A1, the Select statement may be written as shown here: Select * >From Customer Where customer_name = 'SMITH' or postal_code = 'A1A1A1'; Based on this query and the indexes on the Customer table, both indexes may be used. The customer_name index will be scanned for all SMITHs and the postal_code index will be scanned for A1A1A1. The successful index "hits" will all retrieve ROWIDs or RIDs, which subsequently will be merged. Once this index merge is complete, the table data rows will then be accessed. Most major relational databases support this feature, but index access depends greatly on the characteristics of the data in the table and the indexes that have been created. --------------------------------------------------------------------------------
TABLE ACCESS THROUGH VIEWS, SYNONYMS, AND ALIASES Database tables can be accessed directly by naming the table itself in a query. In some cases, however, you may want to "redirect" access to a table by having a name used in a SQL statement point to a different physical table. This can be accomplished through the use of views, synonyms, and aliases. Using one of these features can deliver the following capabilities: renaming columns and tables without affecting application programs accessing tables with a different owner ID than the one used to issue the query querying a different database in a transparent manner to the querying application combining multiple joined tables to look like a single table and doing this in an optimized manner implementing security Views are implemented by all major relational databases; synonyms and aliases are unique implementation features. --------------------------------------------------------------------------------
THE USE OF SYNONYMS Although they may appear to be the same at first, views and synonyms have very different features. Synonyms are simply a way of renaming a table or view to give it a private name and to reference an object without specifying its owner. They can be used to access remote objects so that the remote object appears to be local to the querying application. A synonym can be created on a table as follows: Create synonym Synonym_Name for Owner.Table_Name; Security cannot be issued directly on the synonym, since access must be granted on the underlying table or view. Also, notice that specific columns cannot be chosen in the synonym definition. It is merely a table or view rename. In Oracle, a special feature called "public" synonyms can be used to allow all users to access a specific object (table or view) through a public synonym without needing to know the object's owner or creator ID. Except for public synonyms, synonyms are personal objects that cannot be shared across userids. --------------------------------------------------------------------------------
VIEWS A view can be thought of as a "virtual" table. It has its own Owner (or Creator) name and can have its own View name as well as its own Column names. The view can be created on other tables and views. Views can also be created as joins between multiple tables. An example of view creation is shown here: Create View as Select * from Customer, Customer_Address where customer_cust_id = cust_addr_cust_id; Here are some important points to note regarding views: Grants may be issued on views. Inserts, updates, and deletes may be issued on views except in the following cases: They are join views (except in the case of Oracle's Instead Of triggers); a column not included in the view either is nullable or has default values; functions are used in the view; or a view contains a constant value. Views may be shared and used by multiple userids when security has been accessed to those userids. Views can be used to simplify SQL processing. Views can be used to derive values based on table columns. When objects are dropped, dependent views are also dropped or disabled. --------------------------------------------------------------------------------
ALIASES IN DB2 Aliases are used in DB2 as a distributed synonym. They allow access to distributed objects in a manner that is transparent to the issuing SQL statement. These are different than synonyms in that multiple users can use the same alias, where users can use only their own synonyms (except for the case of Oracle's public synonyms). Also, when a table is dropped in DB2, aliases are dropped, but synonyms are not (in Oracle, synonyms are not dropped and may be used to access remote objects). Sample Create Alias syntax is shown here: Create Alias for Table_name/View_name; --------------------------------------------------------------------------------
GENERATING KEYS IN SQL SERVER Unique keys may be created on a table in SQL Server using the Identity feature. This is a column property that allows a Not Null column to be automatically set and incremented when a new row is inserted. It is important to note that the Identity column cannot be nullable. An example of this feature is shown in the Product table here: Create table Product (product_id int Identity (50000,1) not null, product_description char(50)); Note that the Identity column has an initial value of 50000 for the first insert and will be incremented by 1 for all subsequent Insert statements. The cust_id column also does not need to be referenced in Insert statements, since it will automatically be set. --------------------------------------------------------------------------------
SETTING GLOBAL UNIQUE KEYS IN SQL SERVER A feature similar to Identity is implemented through the uniqueidentifier property, which stores a global, unique, 32-digit hexadecimal value in the column that uses it. When implemented, this column is 16 bytes long. An example of this feature is shown in the Customer table here: Create table Product (product_id uniqueidentifier, product_description char(50)); To insert rows into this table, you must use the newid() function to set the uniqueidentifier. An example of an insert into the Product table is Insert into Product (product_id, product_description) Values (newid(), 'Database Product'); Create the uniqueidentifier as Not Null to use it as a primary key. --------------------------------------------------------------------------------
GENERATING KEYS IN ORACLE New keys can be generated in Oracle through a feature called a sequence. These are objects (such as tables, views, and indexes) that can be used by multiple users to generate primary keys. A sequence is created using the Create Sequence statement as shown here: CREATE SEQUENCE Customer_Seq Increment by 1 Minvalue 1 Maxvalue 1000000000 Nocycle Start with 1 Cache 200 Order; This sequence is called Customer_Seq and can be used to create a primary key such as customer_id on the Customer table. This sequence starts at 1, is incremented by 1, has a maximum value of 1 billion, and is processed in the order of request for a new sequence. Also, the next 200 entries are cached in memory for quicker retrieval by applications. To use sequences, use the Oracle pseudocolumns Currval and Nextval. To see the current value of a sequence, you can use this Select statement: Select customer_Seq.currval >From Dual; To insert a row into the Customer table using the next sequence value, the Insert statement can be coded as follows: Insert into Customer (customer_id, customer_name) Values (customer_seq.nextval, 'Database Company'); This simple method for generating new keys in Oracle databases is an Oracle-specific feature. --------------------------------------------------------------------------------
OUTER JOINS--WHAT ARE THEY? The most common and simple type of join to perform is an inner join. In this join, qualifying rows from one table are returned when they match qualifying rows from another. It is also referred to as a natural join. An example of this is shown in a join between two tables: Customer and Invoice, where we want to see all invoices for all customers named SMITH: Select * >From Customer C, Invoice I Where C.customer_id = I.customer_id and C.customer_name = 'SMITH'; This will return all customers named SMITH that have been invoiced. Suppose, however, that you wanted to see all customers named SMITH regardless of whether they have been invoiced. To accomplish this, we need to do an outer join between Customer and Invoice. An example of this (using ANSI SQL) is shown here: Select * >From CustomerC LEFT OUTER JOIN Invoice I ON C.customer_id = I.customer_id Where C.customer_name = 'SMITH'; An outer join, therefore, is a join between two tables where you want to see rows from at least one of the tables, regardless of whether a related row exists on the other table. --------------------------------------------------------------------------------
LEFT OUTER JOIN AND RIGHT OUTER JOIN As stated in our previous tip, an outer join is a join between two tables where you want to see rows from at least one of the tables regardless of whether a related row exists on the other table. If the table that contains the rows that you want to see is on the left of the query, it is called a left outer join; if it is in the right, it is a right outer join. An example of a left outer join is one where all customers will be returned, and those with invoices will also have their invoice information retrieved. If a customer does not have invoices, its customer information will be retrieved anyway, and the invoice information will be null. This can be written as a left outer join, as shown here: Select * >From Customer C LEFT OUTER JOIN Invoice I ON C.customer_id = I.customer_id; It can also be written as a right outer join, like this: Select * >From Invoice I RIGHT OUTER JOIN Customer C ON I.customer_id = C.customer_id; Left and right outer joins are opposite concepts, but they can be used to accomplish the same thing. --------------------------------------------------------------------------------
FULL OUTER JOIN A full outer join is one where all rows from Table 1 are joined to Table 2 and all rows from both tables (matching the Where criteria) will be shown whether or not there is a row in the other table. An example of this is a case where we want to show all customers that live in a specific Zip code and all invoices with a value greater than $100,000, regardless of the Zip code they are in. This code can be written as follows: Select * >From Customer C FULL OUTER JOIN Invoice I ON C.customer_id = I.customer_id Where C.zip_code = 12345 or I.total_value >= 100000; This could return the following results: CUST_ID ZIP_CODE INVOICE_ID INVOICE VALUE ------- --------- ---------- -------------- 1 12345 70 100 2 67890 80 200000 null null 85 175000 7 12345 null null This shows the following cases: 1) Row 1 with Cust_Id 1 has the correct Zip code (12345) as well as an invoice, even though the invoice value is less than $100,000. 2) Row 2 with Cust_Id 2 has an invoice value of $200,000, and this--as well as the customer information--are shown, even though the customer is in a different Zip code than requested in the query. 3) Row 3 with Invoice_Id = 85 does not have the corresponding customer information (obviously a data integrity problem, especially when we don't know about one of our largest customers), so the customer information is shown as having null values. 4) Row 4 with Cust_Id 7 is shown, even though it has no invoices, because it has the correct Zip code of 12345. So, you can see that the full outer join combines all data that would be shown through a natural, left outer, and right outer join. --------------------------------------------------------------------------------
DATABASE AND OUTER JOINS When coding outer joins, favor left outer joins over right outer joins. Joins will be easier to understand when they are performed the same way. Some databases (such as DB2) will convert a right outer join to a left outer join, so the Explain may be confusing to read once this conversion has taken place. DB2 and SQL Server syntax are ANSI-92 compliant, such that an outer join may be coded like this: Select * From Customer C FULL OUTER JOIN INVOICE I ON C.customer_id = I.customer_id; Oracle syntax is different and is not ANSI-compliant. To perform an outer join so that all customers are shown regardless of whether they have invoices, code a (+) beside the Invoice table name. This left outer join would be written as shown: Select * From Customer C, INVOICE I ON C.customer_id = I.customer_id (+); --------------------------------------------------------------------------------
OUTER JOINS AND UNIONS Outer joins may be coded as a series of SQL statements using the Union operator. Consider the following full outer join: Select c.cust_id, c.zip_code, i.invoice_id, i.invoice_value >From Customer C FULL OUTER JOIN Invoice I ON C.customer_id = I.customer_id Where C.zip_code = 12345 or I.total_value >= 100000; Assume that this query returns these results: CUST_ID ZIP_CODE INVOICE_ID INVOICE VALUE ------- --------- ---------- -------------- 1 12345 70 100 2 67890 80 200000 null null 85 175000 7 12345 null null This outer join could also be coded like this: Select c.cust_id, c.zip_code, i.invoice_id, i.invoice_value >From Customer C, Invoice I Where (C.zip_code = 12345 or I.total_value >= 100000) and c.customer_id = i.customer_id UNION Select c.cust_id, c.zip_code, NULL, NULL >From Customer C2 Where C2.zip_code = 12345 and NOT EXISTS (Select INV.cust_id >From Invoice inv Where inv.cust_id = c2.cust_id) UNION Select NULL, NULL, i.invoice_id, i.invoice_value >From Invoice I2 Where I2.total_value >= 1000000 and NOT EXISTS (Select cust.cust_id >From Customer cust Where cust.cust_id = I2.cust_id); In these Union statements, the first Select performs an inner join, the second Select gets all customers in zip_code 12345 without invoices, and the third Select gets all invoices with total_value >= 100000. Note that the left outer join and right outer join are performed as correlated Subselects. --------------------------------------------------------------------------------
DECIDING ON OUTER JOIN VS. UNION/SUBSELECTS We have two ways of performing outer joins: The Outer Join syntax can be used, or the SQL can be written as a series of Select statements that are Unioned together, where one Select performs an inner join and the others can simulate the outer join by performing a Subselect with the NOT Exists clause. Both can return the same results, so which should you use? The Outer Join statements are much more efficient than the approach that combines Union statements and Subselects. The reason for this is that the Outer Join syntax allows the optimizer to derive a plan that processes each table only one time. The Union + Subselect approach requires the tables involved to be processed multiple times--once for each Select statement. If you are coding SQL to run on multiple databases, consider using the Union + Subselect approach, since this will be portable across databases, while Outer Join syntax is not. Remember to consider the performance impact when making this decision. --------------------------------------------------------------------------------
DATABASE OPTIMIZERS A common feature of relational databases is the cost-based optimizer. The optimizer uses statistics stored in a database catalog and combines these with incoming SQL statements to determine which access paths to choose. The types of statistics that are used by the optimizer are number of rows, number of pages containing rows, data distribution/skewing statistics as shown through histograms, index leaf page count, cardinality of the first index column, cardinality of the entire index, index cluster ratio, and number of index levels. Other environmental information will also be used by the optimizer. Some of these things include the CPU type and speed, as well as data buffer/cache size, sort area size, and storage information. Based on this information, the optimizer will then calculate the CPU cost and I/O costs of executing a SQL statement using different access methods. The method with the lowest cost will be chosen. All relational databases use different algorithms and information to determine access paths. Their approaches, however, are similar, and the access paths chosen are also very alike. --------------------------------------------------------------------------------
OPTIMIZING JOINS--NOT ALL JOINS ARE THE SAME The ability to join two or more tables using SQL syntax is simple. The most basic form of this join between two tables is shown here: Select * >From Table1 T1, Table2 T2 Where T1.Table1_Id = T2.Table1_id; Although this syntax is simple and relatively consistent across DBMSs, the access path chosen by a relational database (RDBMS) may vary considerably because of the database optimization logic, database statistics, and the DBMS itself. A specific join may also be chosen if an SQL "hint" is used to ask the optimizer to use that join technique. A join can be processed in many ways. Some of the most common are nested-loop, sort-merge (aka merge-scan), hash, star, reformatting, and hybrid joins. This is an important concept to understand, and we will be reviewing the different join techniques over the next several days. --------------------------------------------------------------------------------
SORT-MERGE JOINS The sort-merge join--also called a merge-scan--is a very popular join performed when accessing a large number of rows. It is also seen in background tasks, batch processing, and decision-support systems. Take an example of two tables being joined and returning a large number of rows (say, thousands), as follows: Select * >From Table1 T1, Table2 T2 Where T1.Table1_Id = T2.Table1_id; The merge-scan join will be chosen because the database has detected that a large number of rows need to be processed, and it may also notice that index access to the rows is not efficient, since the data is not clustered (ordered) efficiently for this join. Here are the steps to perform this type of join: 1) Pick an inner and outer table. 2) Access the inner table; choose the rows that match the predicates in the Where clause of the SQL statement. 3) Sort the rows retrieved from the inner table by the joining columns and store these as a temporary table (this step may not be performed if data is ordered by the keys, and efficient index access can be performed). 4) The outer table may also need to be sorted by the joining columns so that both tables to be joined are sorted in the same manner (this step is also optional and depends on whether the outer table is already well ordered by the keys and whether efficient index access can be ). 5) Read both outer and inner tables (these may be the sorted temporary tables created in previous steps), choosing rows that match the join criteria. This operation is very quick, since both tables are sorted in the same manner and Database Prefetch can be used. 6) Optionally sort the data one more time if a Sort was performed (for example, an Order By clause), using columns that are not the same as were used to perform the join. The merge join can be deceivingly fast because of database prefetch capabilities and the fact that each table is accessed only one time each. --------------------------------------------------------------------------------
HASH JOIN The hash join is being implemented in most major DBMSs and is a very efficient join. With the hash join, one table is chosen as the outer table. This is the larger of the two tables in the join--and the other is chosen as the inner table. Both tables are broken into sections, and the inner table is "hashed." This hashing provides an algorithmic pointer that makes data access very efficient. The database attempts to keep the inner table in memory since it will be "scanned" many times. The outer rows that match the query predicates are then selected, and for each outer table row chosen, hashing is performed on the key, and the hash value is used to quickly find the matching row in the inner table. This join often can outperform a sort-merge join, particularly when one table is much larger than another. No sorting is performed, and memory usage is efficient for this join. --------------------------------------------------------------------------------
STAR JOINS--PART 1 OF 2 A particular type of join common to data marts and data warehouses is known as the star join, or star query. This is a join of a large Fact table with two or more smaller tables commonly called Dimensions. Fact tables may be thought of as having transactional properties. An example of a Fact table is Sales, which contains keys and some measures. This is commonly a large table with millions--and sometimes billions--of rows. The Dimension tables are used to describe the Fact table. Examples of Dimension tables are Customer, Product, and Time. Star queries get their name because there is a single Fact table in the middle, and the smaller Dimension tables are directly related to the Fact table. There are two ways that relational databases commonly perform star joins. The first will be discussed in today's tip, and we'll cover the second in our next tip. Consider the case of a central Fact table and three small Dimension tables related to the Fact table. The first and most simple approach for resolving a star query is to query the three Dimension tables using the predicates in the Where clause to filter out unnecessary rows. The resulting rows from the three Dimension tables can be joined, forming a Cartesian product. This is not as bad as it sounds, since these tables are relatively small, and the rows matching the Where criteria account for a subset of these. This Cartesian product can be sorted and stored as a temporary table. A merge-scan join can then be performed between the resulting temporary table and the Fact table. Next time we'll look at another approach that the optimizer can take toward resolving star queries. --------------------------------------------------------------------------------
STAR JOINS--PART 2 OF 2 A database optimizer can take a more complex approach toward star joins. Consider the case of the central Fact table that is being joined to three smaller Dimension tables as shown here: Select * >From Fact, Dim1, Dim2, Dim3 Where Fact.dim1_id = Dim1.id and Fact.dim2_id = Dim2.id and Fact.dim3_id = Dim3.id and Dim1.name like :input_variable_name and Dim2.Descriprion between :input_var1 and :input_var2 and Dim3.Text < :input_variable_text; The optimizer can transform this by adding three subqueries so that the new transformed query looks like this: Select * >From Fact, Dim1, Dim2, Dim3 Where Fact.dim1_id = Dim1.id and Fact.dim2_id = Dim2.id and Fact.dim3_id = Dim3.id and Dim1.name like :input_variable_name and Dim2.Descriprion between :input_var1 and :input_var2 and Dim3.Text < :input_variable_text and Fact.dim1_id in (Select dim1.id from dim1 where dim1.name like :input_variable_name) and Fact.dim2_id in (Select dim2.id from dim2 where Dim2.descriprion between :input_var1 and :input_var2) and Fact.dim3_id in (Select dim3.id from dim3 where Dim3.Text < :input_variable_text); Given this, the Subselects are performed first. If bitmap indexes exist on the Fact join columns, then the bitmap index entries that result from each subquery can be merged (in this case ANDed) together, and the Fact table data can be retrieved using the resulting index values. This operation is very quick. The Fact entries retrieved can then be joined to the Dimension tables to complete the query. Using this approach, a Cartesian product is not required. Of course, there is one other way that a database may choose to perform a star query--and that is to break the query down to traditional nested-loop, sort-merge, and hash joins as it would with any other query. --------------------------------------------------------------------------------
REFORMATTING JOINS This join technique is used by Sybase Adaptive Server and many versions of Microsoft SQL Server. Reformatting is chosen in cases where a suitable index cannot be found on an inner table, and where the inner table is large enough that repeated table scans are not a viable option. The reformatting option creates an internal table from the inner table and a temporary clustered index on it. This clustered index is created on the join column, and the join can then be performed between the outer table and the new temporary clustered index. This is a variation on the sort-merge join, and it has similar pros and cons. --------------------------------------------------------------------------------
LIST PREFETCH Accessing indexed data that is not clustered (for example, not ordered by the index) can be inefficient. One way of making this type of data access more efficient is to use a technique called List Prefetch. With this approach, an index is used to retrieve the RIDs (rowids) of the corresponding table row. Once these RIDs have all been accessed, they are sorted by the RID so that they are in order of the physical block of data that must be accessed. Armed with this information, the database can then access table data sequentially, accessing each data block in order--and only once. The database can also take advantage of Sequential Prefetch, where many blocks of data may be read in a single I/O. In the case where an Order By clause exists in the SQL, a sort may be required to put the data in correct order. --------------------------------------------------------------------------------
MULTIBLOCK I/O All major relational databases (RDBMSs) have a feature that allows for Multiblock I/O and, in some cases, Sequential Prefetch. Multiblock I/O allows for a single I/O to read many blocks of data at one time. An example of this is a case where Multiblock I/O is set to 64, allowing for 64 blocks to be read as a single I/O. This feature is excellent when large volumes of data are read by a query, since it means that almost all data to be read will already exist in the database buffer/cache when it is needed. The only time we need to wait for a physical I/O is every 64th block, when the next group of blocks is to be read. In Oracle, this is an initialization parameter that can be set to different values. --------------------------------------------------------------------------------
SEQUENTIAL PREFETCH An enhancement to Multiblock I/O is Sequential Prefetch. This uses the same feature as Multiblock I/O in that several blocks will be read into the buffer at one time, rather than the database reading a single block at a time. For increased system efficiency, the number of blocks being read should be a multiple of the database page size. Therefore, numbers such as 32 or 64 blocks are commonly read at one time. Sequential Prefetch is an improvement over Multiblock I/O because the next group of blocks will be read as the current blocks are processed. Thus, as rows begin to be processed from this set of blocks (say, 64 blocks of data), the next 64 are being read into the buffer, so the database stays ahead of all physical I/O needs. This structure eliminates the need to wait for any physical database I/O, since all needed data has already been read into the buffer. In Multiblock I/O, a wait for a physical read may occur every 64 blocks. --------------------------------------------------------------------------------
SEQUENTIAL DETECTION In some cases, a database may detect that sequential processing is being performed and may turn on Sequential Prefetch to speed up a query. This is called Sequential Detection. The following sample algorithm as used by DB2 demonstrates this point. If the Sequential Detector value is 32, then in cases where nine out of 16 pages (think of the pages as blocks) are sequential out of 32 pages, then sequential detection is turned on automatically by the DBMS, and the next group of pages (or blocks) will be read. The algorithm for sequential detector S is: If (S/4 +1) out of S/2 are accessed sequentially within a group of S, then Sequential Detection can be turned on. Similarly, Sequential Detection can be turned off if this condition is not met. --------------------------------------------------------------------------------
THE HYBRID JOIN The hybrid join is unique to DB2 and is a mixture of the nested-loop join and list-prefetch features. The hybrid join works as follows. Consider the SQL statement where a medium-sized (for example, hundreds of rows) result set will be returned: Select * >From Customer outer, Invoice inner Where outer.customer_name = inner.customer_name and outer.customer_name = 'SMITH'; The hybrid join is chosen and the following steps will be performed. First, access the outer table by index or by sorting the data so that outer-table access will be in the order of the join column on the inner table. The outer table will be accessed by customer_name. Once an outer table row is retrieved, access the corresponding index rows in the inner table. Combine the inner-table RID (rowid) from the inner-table index, and attach this to the row of data from the outer-table rows retrieved. A temporary table has been created that consists of the outer-table row and the inner-table RID. Next, sort the temporary table by the inner-table RID. This action puts the temporary table in the same order as the inner-table rows. Join the temporary table to the inner table using the RID on the temporary table row to get the row on the inner table. This will use List Prefetch to access the inner table. Concatenate the inner-table columns to the outer-table columns in the temporary table. If an Order By existed in the Query (that is not the same as the joined columns), then sort the resulting rows before returning them to the calling programs. --------------------------------------------------------------------------------
COMPARING HYBRID, NESTED-LOOP, AND MERGE JOINS The hybrid join is useful for joins involving a medium-sized number of rows and where a qualifying index exists on the inner table. For larger joins, the merge scan will be used; for smaller ones, the nested-loop join technique is the most efficient. The hybrid join "sweet spot" is small, and a slight shift in data volumes can render it less efficient than an alternate join technique. This is likely the reason why database vendors--other than IBM--have not picked up on this technique. --------------------------------------------------------------------------------
ANTI-JOINS--PART 1 OF 3 Anti-join describes a join that returns rows from one table that do not exist in the other. There are many ways to perform anti-joins with SQL. You can use the following expressions to do this: NOT IN, NOT EXISTS, OUTER JOIN, MINUS, and EXCEPT. We have already covered the first three statements in previous tips. These are implemented in most major RDBMSs. The last two have more limited implementations. The MINUS keyword is not ANSI-compliant, while the EXCEPT keyword is compliant with the ANSI SQL2 standard. These will be covered in the next two SQL tips. --------------------------------------------------------------------------------
ANTI-JOINS--PART 2 OF 3: USING MINUS Some databases use the MINUS operator to return rows from one table that do not exist in another. Here is an example of a SQL statement using MINUS: Select customer_id, customer_name from Customer MINUS Select customer_id, customer_name from Invoice; This statement will show all Customers that have not been invoiced. Notice that when using MINUS, the number of columns and column types must be the same for the two queries. In this way, it is similar to the way that a Union statement operates. It is also limiting, since all of the same columns and values used in the statements must exist in both tables--therefore, there are many cases where the MINUS operator will not be useful. The implementation of the MINUS operator is implemented in Oracle. --------------------------------------------------------------------------------
ANTI-JOINS--PART 3 OF 3: USING EXCEPT The EXCEPT statement is included in the ANSI SQL2 standard, although it is not implemented in many popular databases that advertise themselves as fully SQL2 compliant. Here is an example that lists all Customers that have not been invoiced: Select customer_id, customer_name from Customer EXCEPT Select customer_id, customer_name from Invoice; With the above example, the number of columns and column values must be identical. The EXCEPT statement has two additional features. With EXCEPT, we can use the EXCEPT CORRESPONDING or the EXCEPT CORRESPONDING BY (column list). With EXCEPT CORRESPONDING, all rows from the first Select statement will be shown if they are not in the second SQL statement, and the matching is done for all columns that have the same name and data type on both tables. With EXCEPT CORRESPONDING BY (column list), all rows from the first Select statement will be shown if they are not in the second SQL statement, and the matching is done for all columns in the column list--and these must be of the same name and date type on both tables. The EXCEPT condition therefore has some important features to compare data between two tables. --------------------------------------------------------------------------------
JOINING TABLES USING INTERSECT Tables may be joined using many approaches. The most common approach is to use a natural join, as shown below: Select c.customer_id >From Customer c, Invoice i Where c.customer_id = i.customer_id; Another way to accomplish the same thing is to use the INTERSECT operator. Using this approach, the above statement could be written as Select customer_id from Customer INTERSECT Select customer_id from Invoice; As is the case with the EXCEPT clause, the INTERSECT operator has two additional features. With INTERSECT, we can use the INTERSECT CORRESPONDING or the INTERSECT CORRESPONDING BY (column list). With INTERSECT CORRESPONDING, all rows from the first Select statement will be shown if they are in the second SQL statement, and the matching is done for all columns that have the same name and data type on both tables. With INTERSECT CORRESPONDING BY (column list), all rows from the first Select statement will be shown if they are in the second SQL statement, and the matching is done for all columns in the column list--and these must be of the same name and date type on both tables. The INTERSECT condition therefore has some important features to compare data between two tables. --------------------------------------------------------------------------------
DYNAMIC SQL WITH DB2 DB2 for OS/390 (now called DB2 Universal Database--or UDB) has some important features that can be used to implement dynamic SQL and improve performance. Some of these important performance features are listed below. Use the keepdynamic bind option. This keeps dynamic SQL prepared across commits, eliminating the need to perform the relatively expensive prepare operation after every commit has been performed. Use dynamic SQL with actual values rather than host variables. This will allow the optimizer to perform more accurate cost estimates. In DB2, use dynamicrules(bind) to perform security checking at bind time rather than having this done at run-time. This eliminates the need to grant select, insert, update, and delete access to tables for individual users. With this feature, the user can be granted access to a Plan or Package. Using these features with DB2 will improve system security and performance. --------------------------------------------------------------------------------
B-TREE INDEXES--WHAT ARE THEY? All popular RDBMSs use B-tree indexes as their most common access path. B-tree stands for Balanced tree--the database attempts to balance the levels across multiple branches. This index is a hierarchical index that contains a header block (some DBMSs use the term "page" or "node" for the term "block") at the top level. The header (root) block contains keys and pointers to the intermediate (branch) blocks, which in turn contain keys and pointers to the leaf block. At the leaf level, all keys are entered with pointers (rowids) to the table data. Every row is represented at the leaf level, while every leaf block has an individual entry at the intermediate level and every intermediate-level block has an entry at the header level. --------------------------------------------------------------------------------
B-TREE INDEXES--HOW ARE THEY POPULATED? A B-tree index keeps all index entries in order. Individual entries will be kept in physically continuous blocks but may become logically continuous if a block fills up and an index split must be performed. Within a leaf block, the index entries are not kept in physical order but in logical order. As the table and index are populated, only a header block is first populated. At this stage, the B-tree index is a one-level (and one-block) index. Once this becomes full, leaf blocks are created, and the header block is transformed to point to the first two new leaf blocks. This is now a two-level index. Once the header block again becomes full, intermediate blocks are created. The header block is transformed to point to each intermediate block, and the intermediate blocks are transformed to point to the leaf blocks. The leaf blocks still point to the individual rows on the table. At this stage, we have a three-level index, which is useful for accessing millions of rows. Once the intermediate-level blocks become full, an extra level of intermediate-level blocks can be created, making this a four-level index. --------------------------------------------------------------------------------
B-TREE INDEXES--THE IMPORTANCE OF THE NUMBER OF LEVELS Keeping track of the number of levels in B-tree index structures is important, since this can affect the number of logical and physical reads that are required to access a row of data. Accessing a single row of data using a three-level unique index and a perfectly written query requires that four blocks of data be accessed--the index header, the intermediate-level block, leaf blocks, and the table data. Once an index moves to four levels, an extra I/O may be required. When you see this occur, reorganize your indexes, which could bring the indexes back to a three-level one. In fact, indexes should be reorganized regularly to remove any index block-splitting that may have occurred. You will most commonly see three-level indexes in your database. --------------------------------------------------------------------------------
B-TREE INDEXES--WHAT ARE THEY GOOD FOR? B-tree indexes are good for most common structured data types that do not have large numbers of duplicate values. In particular, B-trees are ideally suited for systems that have the following characteristics: Data types are structured having types of Number and Character. The index columns have a relatively high cardinality--that is, a high degree of uniqueness. A great deal of data access is random or retrieves relatively few numbers of rows. B-trees are also ideally suited for these circumstances: Queries that perform small-range scans of data. Large-range scans of data where the table data is ordered (clustered) by the index being used. Select access and controlled insert, update, and delete activity. On the other hand, B-trees are not ideal for the following circumstances: Indexed columns with low cardinalities (degree of uniqueness). Unstructured data. For example, specialized data types (such as spatial data) require that alternative indexing techniques be adopted. Large-range scans of table data where the data is not ordered by the index. In this case, reading the table data and sorting it is more efficient than using a B-tree index. High volumes of insert, update, and delete activity on indexes that impact performance. To conclude, B-trees support most of our business indexing requirements, but in some cases, they may degrade performance. --------------------------------------------------------------------------------
INSERT, UPDATE, AND DELETE PERFORMANCE ON INDEXES Changing column values that are indexed by B-tree indexes affects system performance. In today's tip, we will cover the performance impact that Delete statements have on B-tree indexes. Deleting a row that has indexed values has a relatively large impact on system performance. When a row of data is physically deleted, the index entries must be deleted as well, which means that the entry on the leaf block (which points to the data row) must be deleted. Remember that an intermediate block (assuming a three-level B-tree index) will have a single entry that contains a high key for each leaf block; if the row being deleted happens to be the entry on the intermediate block, it must be deleted and a new value must be placed in the intermediate block that corresponds to the new high value on the leaf block. Similarly, the index header block contains a single high value for every intermediate block. If the intermediate-block value that was changed was also the highest value for the intermediate block, then the header block must be changed to enter the new high value of the intermediate block. This potential cascade-up of index-entry deletions slows performance and can even cause locking issues. Next time, we will discuss the effect of inserting new entries in an index. --------------------------------------------------------------------------------
INSERTING AN INDEXED VALUE Inserting a row into a table incurs significant index maintenance overhead. When an indexed row is inserted, a new index leaf-block entry must be added. This entry must be placed in the correct leaf block, since the index entries are always kept in order. Some shifting of entries in the leaf block may be needed to keep the values in that block in the correct logical sequence. If the new entry is the new high value in the leaf block, then the intermediate block (assuming a three-level B-tree index) will need to be changed to put the new value pointing to this leaf block in place of the previous high value. Similarly, if this is the new high value of the intermediate block, the header block must be changed to replace the old high value pointing to the intermediate block with the new one. If this were not complicated enough, an insert into an already full leaf block causes a leaf-block split. In this situation, half of the leaf-block entries remain in the same place and a new leaf block will be created with the other half of the entries; the new inserted value will be placed on the correct leaf block. This causes two changes to occur on the intermediate block. First, the original leaf block may need to have the original high value replaced with the new one. Second, the new leaf block will need to be added to the intermediate block. This may also cascade-up to the header block. In the case that an intermediate block is full and a new entry is being added, a "block split" may also occur at the intermediate-block level with similar changes being performed at the header block. Index inserts can affect both system performance as well as locking contention. It is also easy to see why regular index reorganization goes a long way toward solving system performance problems. --------------------------------------------------------------------------------
THE IMPACT OF UPDATING INDEXED COLUMNS You can think of updating an indexed value in the same manner as performing a delete and an insert of the indexed value. An update can incur all of the overhead associated with deletes and inserts that were covered in earlier tips. When updating columns in your system, be aware of indexed columns and don't include columns that do not change in your Update statements because of the added system overhead that may be incurred. --------------------------------------------------------------------------------
THE LOGICAL DELETE If your system performs a large number of deletes and the performance of these physical deletes (that is, performing the deletes with the SQL Delete statement) makes system performance inadequate, an alternate design approach may be used. This approach uses the concept of the "logical delete." With this approach, whenever a delete needs to be performed, a special column (call it audit_change_type) will be updated to D. >From now on, this row will be treated as if it were not there. This can be accomplished by coding programs to ignore all rows with an audit_change_type of D or to access the table through a view that filters out these rows. A background task that archives these rows and physically deletes these logically deleted rows can then be scheduled to run during times when the system is quiet. Table or index reorgs can also be scheduled to correspond with the archive/purge process. This process avoids heavy index maintenance activity that deletes incur during busy system times and gives you more control over database performance. --------------------------------------------------------------------------------
HANDLING A LARGE NUMBER OF INSERTS You can use several approaches to relieve the overhead that high insert activity imposes on a system. The first and most simple approach is to provide a lot of free space on tables and indexes that have a large number of inserts. This can alleviate some of the block-split activity that occurs on indexes and is particularly effective when insert activity is random. There is no benefit to this method when insert activity is sequential, since the high values will always go to the end of the pageset (table or index). Another approach is to drop problem indexes before the insert activity occurs and to re-create the indexes after the heavy activity has ended. In this way, index maintenance will be avoided and the indexes will stay well organized. -------------------------------------------------------------------------------- CLUSTERING INDEXES AND CLUSTER RATIO Clustering indexes are indexes created on tables that are in the same order as the table data. Because of this, only a single clustering index can be created on a table--and, of course, other nonclustering indexes may also be created on a table. A clustering index may be unique or nonunique. An example of a clustered index is the case of a Customer table where a clustering index is created on the customer_name column. In this case, the table data will be ordered by customer_name. This is extremely useful for systems that want to retrieve data by customer_name, want to use an index to access the data, and want customer_name ranges to be physically located on a small number of index and table blocks. Database implementations of this technique vary, and it is important to understand the clustering implementation of your favorite DBMS. In upcoming tips, we will cover DB2, Sybase, SQL Server, and Oracle implementations of this tip. --------------------------------------------------------------------------------
CLUSTERING INDEXES IN DB2 When a clustering index is created on a DB2 table, it is created in the same manner as other indexes. In other words, the index is implemented separately from the table data. Only one clustering index can be created on a DB2 table. Once the clustering index has been created, DB2 will attempt to place any new data in clustering order. This is possible when there is significant free space in the table block where a data row should be placed to keep it in the correct order. If the correct data block (or "page") is full, then DB2 will attempt to place the data row in a close block/page that will likely be read when Sequential Prefetch is performed along with other values similar to this. In other words, index entries are always ordered and DB2 will attempt to order all table data, but this is not guaranteed. A table reorg will put all data back in clustered order. A table is considered to be clustered if it is 95 percent in clustering order. -------------------------------------------------------------------------------- CLUSTERING INDEXES IN SQL SERVER AND SYBASE Clustering indexes are implemented the same way in Microsoft's SQL Server and Sybase's Adaptive Server. They are referred to as clustered indexes. As was the case with DB2, there can be no more than one clustered index implemented on a table. The difference with this implementation is that the table's data rows are actually the index leaf pages. In other words, rather than the index leaf page pointing to a data row, the table data is appended to the clustering key on the leaf page. Because of this, the data is physically ordered in the same sequence as the index. These indexes are excellent when many range scans will be performed in the same order as the clustered index. In cases of frequent inserts and deletes, careful thought must be given to the amount of fillfactor (free space) to avoid fragmentation as much as possible. The clustered indexes should be carefully monitored to determine when table reorgs should be performed. -------------------------------------------------------------------------------- CLUSTERING INDEXES IN ORACLE Oracle assigns a different meaning to the term index cluster than do other databases. In Oracle, clusters and cluster keys can be used to allow more than one table to physically place related rows into the same segment. This speeds up joins, since the data from multiple tables can be read in the same physical and logical I/O. Index clusters in Oracle have limited use. These are useful only if the tables using the index cluster are almost always accessed together. If they are not accessed together, then table scans of a single table require more blocks to be read. These also have additional insert overhead to keep the cluster ordered properly. One other important thing to consider is that the cluster's join performance may not be a lot better than table joins. Use this feature only in extreme situations. Oracle offers another option called index-only tables, which we will examine in our next tip. ----------------------------------------------
INDEX-ONLY TABLES IN ORACLE Oracle offers the new (in version 8) feature of index-only tables (also referred to as IOT and index-ordered tables). These are similar to Sybase and Microsoft's implementation of clustered indexes, since the table data is stored in the index leaf page. This forces the data into the physical order of the index. One major disadvantage of these tables is than no other indexes can be created on them, since a rowid is not associated with the data row. One feature of these is to create a PCTTHRESHOLD, which is the percentage of rowlength to be stored in the B*-tree. The columns frequently accessed should be at the beginning of the table and have a total length smaller than this PCTTHRESHOLD value. The remaining columns will be placed in an overflow tablespace. Index-only tables have limited use in their current implementations and should be used only when you are sure that just one index will be needed on the table and that the order of the data is very important. ---------------------------------------------- PARTITIONED TABLES--WHAT ARE THEY? Table partitioning is the capability to implement the data from a single table across multiple datasets. This partitioning can be done in a sophisticated manner as implemented by a DBMS (database management system), or it can be application-controlled. Partitioning is usually done to improve performance in working with very large tables and to facilitate database availability and parallelism. An example of this would be to partition a table by month so that new data will always be implemented on a brand-new dataset and the oldest months' data can be archived easily and deleted by removing the oldest dataset. This can also be used to allow database utilities to be performed on a specific partition, allowing greater database availability and greater utility performance. Some common partitioning schemes are vertical/horizontal, hash, round-robin, and range partitions. The implementation of these as well as the major DBMS implementation of them will be covered in the next several tips. ----------------------------------------------
VERTICAL AND HORIZONTAL PARTITIONING The two overriding partitioning approaches that other approaches fall within are horizontal and vertical partitioning. In vertical partitioning, a table is split by columns. In other words, some columns will be implemented in one table and other columns may be implemented in other tables. Vertical partitioning is useful when large performance, space, availability, or backup and recovery benefits will be realized. This type of partitioning is inherent to a system's architecture and design, and applications must be coded to handle this type of partitioning. It is possible but not likely that applications using this data could encounter problems because of the partitioning strategy. Horizontal partitioning is the situation where some rows from an entity (or object) are implemented in one table, and the remainder of the rows are implemented in other tables. This is the most common type of partitioning, since an entire row is always stored together, and since many DBMSs support this type of partitioning without applications needing to be aware of it. Common DBMS partitioning approaches of hash, round-robin, and range partitioning are all forms of horizontal partitioning that are implemented by DBMSs. These approaches are generally favored and can often be implemented transparently to an application. ----------------------------------------------
PARTITIONING BY DBMS Some common partitioning schemes are implemented using standard and--in some cases--advanced DBMS features. These are the preferred approach, since they are supported by the DBMS vendor and can be implemented in a manner that is transparent to the application. Round-robin partitioning can be found in massive parallel processing (MPP) systems and parallel server implementations, where the DBMS spreads a single table across many disks transparent to an application. As each row is inserted to a table, it is assigned the next table partition--or fragment--and this continues with each insert until the last partition is reached and the first partition is used with the next insert. This can greatly speed data inserts and can speed up certain types of select, update, and delete activity by spreading I/O across multiple datasets and disk drives. Hash partitioning is similar to round-robin partitioning, except that a hashing algorithm is used to determine which table partition (dataset and drive) a particular row can be inserted to, read from, and updated. This partitioning method can be very fast for inserts, selects, updates, and deletes of data, since the same algorithm that was used to place data on an insert can be used to locate the data in a subsequent operation. Range partitioning is performed when the partition is based on a key value. Ranges of key values are set up for each table partition (dataset) to determine where to locate a row. One common use of this is to partition data by date range to simplify data backup and recovery as well as data archiving. The downside of date-range partitioning is that all current activity will be performed on a single partition. ----------------------------------------------
PARTITIONED TABLES IN DB2 DB2 does its standard data partitioning through range partitioning, which is performed through a partitioned tablespace that a table is created in. It is therefore the tablespace that is partitioned--not the table--with each part implemented on a different dataset. A single table is created on the partitioned tablespace, and a special clustering index is created to do two things: It orders the data by the key that will be used to set data ranges, and it determines the high and low keys of each partition range. The clustering index itself is partitioned with the same ranges as the partitioned tablespace, which allows partition independence. Partition independence provides many benefits: - Individual partitions can be locked. - Individual partitions can be taken online and offline. - Utilities can be run in parallel against individual partitions. Partitioned tablespaces have existed in DB2 for more than ten years, and this is the most mature partitioning method in the industry. DB2's MPP database offerings provide more complex partitioning methods than those discussed above. ----------------------------------------------
PARTITIONED TABLES IN ORACLE Oracle has a partitioning implementation similar to IBM's DB2. This is a relatively new offering with Oracle's version 8 release. Range partitioning is performed on a table with each partition implemented on a different file. This partitioning benefits query parallelism as well as utility independence and system availability. Spanning indexes (that is, indexes that span multiple partitions) may be created as global or local indexes. The local option may be used for any index. This allows the index to be partitioned in the same manner as the table, which is important to provide true partition independence. Partition independence allows an individual partition to be brought offline while others remain online, and it allows utilities to be run against multiple partitions at the same time. It has many benefits for large tables and is almost necessary for very large databases. Oracle's partitioning is an optional item that must be purchased separately from the DBMS. ---------------------------------------------- APPLICATION/MANUAL PARTITIONING Data partitioning can be performed without the benefit of DBMS-implemented partitioning. This is done by "manually" partitioning data. In other words, data that would normally belong to one table can be divided by row range and/or columns into multiple tables. The biggest difference between this implementation and DBMS-implemented partitioning is that the application must be written to use the proper partitions with application partitioning. So, if a single table is split into two or more tables, the application must know which table or tables to access. Database views may be used to make the implementation transparent to an application, but in most cases, inserts, updates, and deletes will need to be performed to the proper tables. Also, database referential integrity implementation must now deal with the multiple tables. Undoing application partitioning can be difficult. After a few years, the original reasons for performing partitioning may no longer exist, but to undo the partitioning may require major application system changes. Therefore, the pros and cons of application partitioning should be carefully weighed before implementation. ----------------------------------------------
DEADLOCKS Deadlocks--sometimes called the "deadly embrace"--occur when two systems try to gain access to data that the other system has locked. For example, suppose System A has Row A of Table A locked, and System B has Row B of Table B locked. If System A then tries to acquire a lock on Row B of Table B (say, to perform an update) and System B then tries to acquire a lock on Row A of Table A, neither system can move forward in its processing. They are in a deadlock situation. The database will time out one of the two systems and allow the other system to continue. Deadlocks usually can be avoided by having systems access tables in the same order and by trying to have data on different tables clustered in the same order. This is not always possible and deadlocks can still occur, but it will remove most of deadlocks that your system would otherwise have acquired. ---------------------------------------------- THE DATABASE MARKETPLACE--OBJECT RELATIONAL DATABASES The database marketplace is undergoing another major shift in architecture as many vendors embrace the concept of implementing object-relational databases (ORDBMSs). This merging of relational and object database technology is the answer to the long-asked question of which architecture would win the database war--relational or object technology? Object technology is still built into popular relational databases, and this trend picked up momentum with the Informix purchase of Illustra and the introduction of Data Blades into the Informix product line. This purchase spurred Oracle to add some similar object features into its Oracle 8 offering called data cartridges, while IBM's strong ORDBMS includes relational extenders. Sybase is using its Adaptive Server architecture and Microsoft is relying on OLE-DB and ADO to gain a foothold in this market through different architectures. ORDBMSs include all those features previously included in relational databases, along with new features such as user-defined functions, user-defined datatypes, user-defined operators, and user-defined indexing. These are combined with enhancements to SQL as introduced through ANSI's SQL-3 standard. This change, combined with the introduction of Java and XML into databases, is rejuvenating the database marketplace and positioning ORDBMSs as key components of any corporate and Internet-based system architecture. ----------------------------------------------
USING DATE COLUMNS IN TABLES AND SQL When implementing date and time in your database, take advantage of built-in DateTime datatypes rather than implementing these in character fields. This will provide the advantage of using built-in DateTime functions rather than having to code your own. SQL-92 offers four items relating to date and time: - Date includes only the date. - Time includes only the time. - Timestamp includes the date and time. - Interval is an interval between two date/time datatypes. Standard functions that can be used are - Datetime - Datetime--for example, 2000-03-31 - 2000-03-29 Day = 2 - Datetime +- an Interval--for example, 2000-03-31 + Interval '1' Day = 2000-04-01 - Interval */ number--for example, '3' DAY * 3 = 9 We will look at the implementation of date and time datatypes by popular DBMSs in upcoming tips. ----------------------------------------------
DATE AND TIME IN SQL SERVER Microsoft's SQL Server and Sybases Adaptive Server have similar implementations of date and time. The datatype used is Datetime and stores both the date and time components in it. The date includes the century while the time component stores time to the millisecond (to an accuracy of three milliseconds). Date and time arithmetic can be performed so that one date can be added to or subtracted from another. One unique feature of SQL Server's Datetime datatype is the case where a two-digit year is supplied. If this year is 50 or over, SQL Server appends 19 to the year and if the two-digit year is less than 50, SQL Server appends 20 to the year, assuming this to be the 21st century. It is important to note that the default base date is January 1, 1900, and the default base time is 12:00 midnight. So, if you only insert a time to a Datetime column, the date will be Jan. 1, 1900, and if you insert only the date, the time will be stored as 12:00 midnight. ---------------------------------------------- DATE AND TIME IN DB2 DB2 provides three datatypes to support date and time: Date, Time, and Timestamp. When date and time values are to be used independently, then the Date and Time datatypes should both be used. They will use less space (seven rather than ten bytes) and have more formatting options than the Timestamp datatype. However, if date and time arithmetic must be performed together, a single Timestamp column is easier to use than two columns for date and time are. Also, for greater time precision, the Timestamp datatype wins. A Timestamp column stores time to the microsecond (six decimal places) while time only stores data to the second. DB2 Date and Time datatypes offer more options, and care should be taken to determine how your Date and Time columns will be used before deciding which datatype to use. If you feel that date and time will be needed together, then you should opt for the Timestamp datatype. ---------------------------------------------- THE DATABASE MARKETPLACE--WHO HAS THE MARKETSHARE? Who are the major players in today's database marketplace? According to Dataquest's surveys earlier this year, IBM leads Oracle by less than 1 percent in total RDBMS (Relational Database Management System) revenue, and both companies hover at approximately 30 percent marketshare. This includes all platforms--from NT and UNIX to mainframes, where a lot of IBM's revenue comes from. Next on the chart is Microsoft, which has about 10 percent marketshare by revenue. The next tier sees Informix and Sybase with 4 to 5 percent marketshare each. These numbers are difficult to interpret and somewhat misleading, since the numbers change dramatically when the number of licenses sold are counted. One trend worth noting: These are the top five RDBMSs (and ORDBMS--Object-RDBMS), and IBM, Oracle, and Microsoft are the big three to watch in the next couple of years. ---------------------------------------------- DATA MODELING Before we ever get to code SQL on an existing database, a great deal of planning must be done to ensure that the database is designed properly so that systems (and SQL) can operate on it. Much of this design work comes in three distinct steps. First, a data architecture is done. This can be in the form of a conceptual model that covers the entire enterprise. This "blueprint" can be used to determine where new data fits in, and it can be used to help guide the detailed modelers to design data for different systems that can fit together. Second, a logical data model is created for a system. This model must fit within the data architecture and must service all the needs of the system to be built. It is at this stage that the data modeler focuses more on "what is to be done" rather than "how to do it." This model is very detailed. All entity types and attributes are created with definitions and properties (datatype, length, scale, nullability). This data goes through a process called normalization, which we will discuss in our next tip. Third, the logical data model is transformed into a physical data model. It is here that the physical data modeler must focus on how data is to be accessed. This individual (or group of individuals) must be familiar with the logical data model, the system to be built, and the internals of the database that the data is to reside on. This is the model of the database that will be implemented. Performing these steps will go a long way toward preventing unwelcome surprises at development and implementation time. ---------------------------------------------- DATA NORMALIZATION--SECOND NORMAL FORM Once data is in First Normal Form, it is said to be in Second Normal Form when all attributes that are not part of the key are dependent on the entire key. For example, if an entity has a primary key consisting of two columns and one of the attributes is fully dependent on only one of the columns, it is not in Second Normal Form and the attribute should be moved to another entity type. ---------------------------------------------- NORMALIZATION--THIRD NORMAL FORM An entity is in Third Normal Form if it is already in Second Normal Form and all nonkey attributes are not dependent on any other nonkey attribute. Therefore, to be in Third Normal Form, an attribute must be dependent on the key, the whole key, and nothing but the key. To summarize, data is in First Normal Form when all entity types have a primary key, all attributes are atomic, and there are no repeating attributes on an entity type. Data is in Second Normal Form when all attributes that are not part of the key are dependent on the entire key. Data is in Third Normal Form when all nonkey attributes are not dependent on other nonkey attributes. There are more steps in normalization (fourth and fifth normalization), but they are subtle and the first three are the key ones to know. ---------------------------------------------- DEFAULT COLUMN VALUES IN SQL SERVER In SQL Server, a column's default value can be specified as part of the Create Table statement. For example: Create Table PRODUCT (product_id int not null, product_category_name char(50) default 'CLOTHING') In this example, a value of 'CLOTHING' will be set for the product_category_name column in cases where an insert is performed but this column's value is not set. ---------------------------------------------- DEFAULT COLUMN VALUES IN DB2 IBM's DB2 provides standard default values and user-defined defaults. The standard defaults are as follows: - Numeric datatype = 0 - Character datatype = Spaces - Varchar datatype = zero length string - Date, Time, and Timestamp = current date, time, and timestamp An example of a table with defaults is shown here: Create Table PRODUCT (product_id Integer Not Null, product_category_id Integer Default 121, product_category_name char(50) default 'CLOTHING', product_name char(50) not null with DEFAULT) In this example, any insert performed without the bottom three columns defined will contain a product_category_id of 121, a product_category_name of 'CLOTHING', and a product_name of Spaces. ---------------------------------------------- JAVA AND DATABASES Many of today's databases implement Java as SQL Persistent Stored Modules rather than implementing only proprietary software. In other words, rather than coding procedural SQL modules such as stored procedures or triggers in such languages as Transact-SQL (Sybase) or PL*SQL (Oracle), you can now write them in Java. This is true for databases such as DB2 UDB, Sybase Adaptive Server, and Oracle 8i. Java provides the benefits of increased portability of code, better performance, and leveraging a language that is open, popular, object-oriented, and known by many programmers. Java database access can be performed through JDBC or SQLJ. JavaBlend is the object-relational mapping for Java. We will explore these concepts in upcoming tips. ---------------------------------------------- JAVA AND JDBC Java Database Connectivity (JDBC) is the Java-based Database Connectivity API for accessing relational databases. It is based on X/Open's Call Level Interface. JDBC itself is written in Java, allowing it to have the same benefits as other Java code. A proposal to extend JDBC to support SQL3 types is under development. Many interfaces have provided for JDBC. An example of this is the JDBC-ODBC driver provided by JavaSoft. There are also database-independent drivers and drivers that support proprietary connectivity. JDBC allows Java programmers to access most major databases using the same common API in the same way that ODBC does. ---------------------------------------------- JDBC COMMON DRIVER CLASSES Some common JDBC drivers include the following four classes: - DriverManager Class--This class creates and manages database connections through methods to manage drivers. Some of these methods are registerDriver, deregisterDriver, getConnection, and getDrivers. - Connection Class--This class has methods to prepare SQL and manage transactions. This represents each database connection. Examples are createStatement, prepareStatement, commit, rollback, and close. - Statement Class--The methods in this class execute the SQL statements. This includes executeQuery, execute, executeUpdate, getMoreResults, and getCursorName. - ResultSet Class--The methods in this class process SQL statement results and contain data that is retrieved from the database. Examples include next, getMetaData, and getCursorName. In our next tip, we will look at an example of some JDBC statements. ---------------------------------------------- JDBC EXAMPLES: SETTING UP THE CONNECTION To run a SQL statement through JDBC, some work must be performed to set up the database connection. Here's an example of four things needed to create this connection. - Develop a URL (uniform resource locator) to define a database named Production that will be accessed with a JDBC driver accessing an ODBC database. For example: String url = "jdbc:odbc:production"; - A JDBC-ODBC driver must be identified: Class.forName("sun.jdbc.odbc.JdbcOdbcDriver"); - Create the connection with the userid and password: Connection aConn = DriverManager.getConnection(url,"User01","pwd02"); - Create a statement object instance to execute the SQL: Statement aState = aConn.createStatement(); Next time, we'll see the execution of the SQL statement. ---------------------------------------------- JDBC EXAMPLES: RUNNING THE SQL Once the JDBC database connection has been established, we must define the SQL statement, execute it, and close it. Here is an example of five steps that perform this. 1) Define a string containing the SQL statement: String newQuery = "Select name FROM Product"; 2) Execute the SQL query using the createStatement object instance aState, which was previously defined (in our last tip): ResultSet rSet = aState.executeQuery(newQuery); 3) Get a row of data using the "resultSet next" method, which moves a pointer to the next row of data (similar to a cursor) and returns a boolean value of True if a row exists: boolean more = rSet.next(); 4) Move the data to a String (previously defined) named vProductName: vProductname = rSet.getString(1); 5) Close the connection and statement: rSet.close(); aState.close(); aConn.close; ----------------------------------------------
JDBC--PREPARING SQL When executing SQL multiple times using different bind variables, the statement should be prepared so that it can be reused when it is run multiple times. Here is an example of this: 1) Define a string containing the SQL statement: String prepNewQuery = "Select name FROM Product WHERE productId = ?"; 2) Prepare the statement to be run: PreparedStatement prepState = conn.prepareStatement(prepNewQuery); 3) Set a variable to the product_id to be searched: vProductId = 100009; 4) Set the variable in the statement to be the same as the one above. This is the first and only variable in the statement: prepState.setBigDecimal(1,vProductId); 5) Execute the prepared statement: prepState.executeQuery(); Preparing a statement in this manner will deliver performance benefits, since the SQL statement will not need to be parsed, prepared, or optimized for every value of productId. ----------------------------------------------
JDBC AND STORED PROCEDURES SQL Persistent Stored Modules, such as stored procedures, can be executed through JDBC. To do this, the CallableStatement class must be used. An example is shown below, where the getProductName stored procedure is executed: 1) Prepare the call of the stored procedure: CallableStatement callState = conn.prepareCall("{? = call getCustName(?)}"); 2) Register the output parameter that will contain the name: callState.registerOutParameter(1,Types.Varchar); 3) Set up the input variable that will be passed to the stored procedure using an already defined and set variable for the customer number: callState.setInt(2,vCustNumber); 4) Execute the stored procedure: callState.executeQuery(); 5) The output string can be retrieved by moving the following to a string variable: callState.getString(1); As you can see, the advantages of database-stored procedures can be maintained when performing JDBC database access. ---------------------------------------------- SQLJ AND JAVA SQLJ is the ANSI/SQL standard for coding embedded SQL in the Java programming language. This interface was developed by several companies, including IBM, Oracle, JavaSoft, Sybase, Informix, Compaq, Microsoft, XDB, and Cloudscape. This standard allows for the extension of Java programs to include static SQL. SQLJ translators are utilities that convert SQLJ statements into Java code that can access a database through a call-level interface. This code can then be compiled by a Java compiler. SQLJ supports only static SQL. To run dynamic SQL, JDBC must be used. SQLJ has mechanisms to enable Java to switch easily between the JDBC and SQLJ environments. SQLJ therefore leverages many of the benefits of JDBC. ---------------------------------------------- SQLJ SYNTAX Some basic SQLJ syntax is shown below. SQLJ statements have the following characteristics: - They start with #sql and terminate with ; - Host variables begin with : - Text is enclosed in braces {} Here is an example of a simple SQLJ statement: #sql {SELECT name into :varName FROM Customer WHERE customer_id = :varCustId}; ---------------------------------------------- SQLJ ADVANTAGES The SQLJ interface is similar to other call-level interface languages and requires less code and is easier to write than JDBC. Because SQLJ allows only static SQL, it has the advantage of performing compile time statement checking rather than incurring run-time checking as JDBC. There is also stronger type checking with SQLJ than there is with JDBC. One disadvantage of SQLJ is its inability to include dynamic SQL in the native format. However, dynamic SQL can be included through the JDBC interface, which may coexist with SQLJ in the same Java environment. SQLJ is implemented in many relational databases, such as IBM's DB2 UDB, Oracle 8.1, and Sybase Adaptive Server Anywhere. SQLJ syntax is more concise than JDBC, as this example, written in SQLJ, shows: int varCustId; #sql {INSERT into Customer VALUES (:varCustId)}; Here is the same statement written in JDBC: int varCustId; Statement stmt = conn.prepareStatement (INSERT into Customer VALUES (?)); stmt.setInt(1,varCustId); stmt.execute (); stmt.close(); ---------------------------------------------- SQLJ 2.0 FEATURES The SQLJ 2.0 upgrade implements many new SQL3 features that deliver object benefits to relational databases. Some new database types included in version 2.0 are user-defined data types, object data types, and Java classes. Support for SQL3 data types such as BLOBs, CLOBs, arrays, references, and structured and distinct types are now included in version 2.0. SQL3 ADTs (abstract/user-defined data types) and metadata for user-defined types have added to the specification. Java classes can also be added as database types. Another great feature is persistent Java objects stored in databases. This feature allows us to write SQL Persistent Stored Modules (such as stored procedures and triggers) in Java rather than in proprietary languages such as PL/SQL or Transact/SQL. ---------------------------------------------- SQLJ ITERATORS In application programs that are written in languages other than Java, cursors can be used to return rows from a table to the program. These cursors are a type of pointer that allow an application program to point to a specific row in a table, which can be retrieved or fetched into a program. This concept of a cursor is implemented in SQLJ as a result set iterator, which is a Java object that can be used to retrieve the rows from a table. This iterator can be passed as a parameter to a method. The iterator is, in effect, a strongly typed cursor. An iterator declaration clause is created that specifies the Java data types declarations. These declarations represent the table's columns. The two types of result set iterators are positional and named. Positional iterators are created with columns of the result set iterator corresponding to columns of the result table, in left-to-right order. A named iterator is created with names for each of the iterator columns. These names must match the names of columns in the result table for the query. ---------------------------------------------- SQLJ-NAMED ITERATORS You can use named iterators to select rows from a result table using SQL statements other than fetch statements. When using this approach, specify names for each of the iterator columns. These names must match the names of columns in the table being queried. SQLJ uses the named iterator to create a named iterator class, and within this, an accessor method is created for each named column. These methods are used to retrieve the named column from the table. Unlike positioned iterators, named iterator columns do not need to be in the same order as the columns are in the table being accessed. Let's look at an example that defines a named iterator called byNameCust, with two columns named LastName of type String and CustomerId of type Int. Iterator byNameCust is a class that has accessor methods LastName() and CustomerId() that retrieve data from the Customer table. The code for this is #sql public iterator byNameCust(String LastName, Int CustomerId); Now, declare an object of the byNameCust class as follows: ByNameCust nameiter; Next, create variables for lastName and customerId and assign a SQL statement to an instance of the named iterator class as follows: String lastName; Int customerId; #sql nameiter={SELECT lastname, customerid FROM Customer}; The iterator object nameiter executes the SQL select statement and constructs an object that contains the results. As long as there are rows being returned to the program, perform a loop that calls the methods to return the values lastName and customerId to the variables that were declared above. This is shown here: while (nameiter.next()) {lastName = nameiter.LastName(); customerId = nameiter.CustomerId(); } Putting all of this code together looks like this: {#sql public iterator ByNameCust(String LastName, Int CustomerId); ByNameCust nameiter; String lastName; Int customerId; #sql nameiter={SELECT LASTNAME, CUSTOMERID FROM CUSTOMER}; while (nameiter.next()) {lastName = nameiter.LastName(); customerId = nameiter.CustomerId(); } } ---------------------------------------------- SQLJ ITERATORS FOR UPDATES AND DELETES--PART 1 OF 2 SQLJ-positioned updates and deletes differ from other iterators in that they need two Java source files. In the first file, the iterator is declared. The implements clause is used with sqlj.runtime.ForUpdate and the iterator is declared as Public. Shown here is an example of a declaration of iterator DelNameCust with string column CustName: #sql public iterator DelNameCust implements sqlj.runtime.ForUpdate with(updateColumns="CustName"); You then use the iterator in a different source file. It can also be used by more than one source file. In our next tip, we'll look at using this declared iterator in the second file. ---------------------------------------------- SQLJ ITERATORS FOR UPDATES AND DELETES--PART 2 OF 2 Once an Insert and Update iterator has been created in the first file, it can be used in subsequent files. As a reminder, the declaration of iterator DelnameCust is shown here: #sql public iterator DelNameCust implements sqlj.runtime.ForUpdate with(updateColumns="CustName"); To use this iterator in a subsequent file, perform the following steps: Import the class, declare an instance of the class, perform a select statement for a positioned update or delete, and execute the update or delete using the iterator. An example of this follows. Import the named iterator class DelNameCust, which was created by the iterator declaration in a separate file (shown above): import DelNameCust; Declare an instance of the DelNameCust class: DelNameCust delCustIter; Declare the string variable for column custname and perform a select statement for a positioned update or delete. The iterator object is assigned to delCustIter. String custname; #sql delCustIter = { SELECT CustName FROM Customer WHERE CustomerId = 1234567}; Position iterator delCustIter to the next row to be deleted and retrieve the result values into custname: while (delCustIter.next()) { custname = delCustIter.CustName(); Perform the delete where the cursor (iterator) is positioned: #sql { DELETE WHERE CURRENT OF :delCustIter }; } Putting it all together, the code should look like this: In filea.sqlj: #sql public iterator DelNameCust implements sqlj.runtime.ForUpdate with(updateColumns="CustName"); In fileb.sqlj: import DelNameCust; { DelNameCust delCustIter; String custname; #sql delCustIter = { SELECT CustName FROM Customer WHERE CustomerId = 1234567}; while (delCustIter.next()) { custname = delCustIter.CustName(); #sql { DELETE WHERE CURRENT OF :delCustIter }; } } ---------------------------------------------- SQLJ--ESTABLISHING A CONNECTION SQLJ statements are associated with a connection context. The context type identifies the database connection and schema. This allows the proper tables, views, synonyms, and privileges to be selected. In the following example, the database connection is chosen and then a SQL statement is executed using that database connection. #sql context Customer; Customer cust = newCustomer("jdbc:odbc:custdb.db"); int varCustomerId; #sql [cust] { insert into Customer values (:varCustomerId)}; ---------------------------------------------- SQLJ AND STORED PROCEDURES Calling stored procedures in SQLJ is simple and can be performed in two ways. The first is with a call construct, as shown here: #sql (call CustNewName(:CustId, ;CustName) }; The second way to execute a stored procedure is with the values construct: #sql custName = { values ( CustRetrieveName(:CustId)) }; ---------------------------------------------- DATABASE LOCKING AND ISOLATION LEVELS RDBMSs perform locking on database objects to provide a "guaranteed" level of data integrity when many users access the database at the same time. The database locking features spare the programmer from having to code this into their application programs. Although this locking is performed automatically by the DBMS, several installation and transaction options can be set to the database locking to be performed in a manner that is optimal for the program being run. For example, a standalone batch job running at night has different locking requirements than a multiuser online transaction or a background task that runs concurrently with online users. Also, read-only (DSS) users have different requirements than OLTP (online transaction processing) users who are updating the databases. Some things that we need to consider when trying to determine our locking levels are - lock type - locking granularity - lock escalation - isolation levels Remember that each lock held has overhead associated with it and that locking conflicts can severely slow down a system if locking has not been implemented properly. Our goal is to implement a locking approach that is not intrusive but delivers data integrity to the applications acting on the data. ---------------------------------------------- LOCK TYPES Common lock types found in today's popular RDBMSs are - Share (S) lock--Acquired when one user reads data and allows other users to acquire S and U locks on the same data. - Update (U) lock--Acquired when a Select...for update statement is executed. This signals that a program intends to update the given object in a future statement. It allows other S locks on the object but not other U or X locks. - Exclusive (X) lock--Acquired when an object has been updated through an insert, update, delete, or Lock in Exclusive Mode operation. This does not allow any other statement to be performed on the object by another user session until this lock has been released. These locks may be applied to any object, such as a row, block/page, table/index, partition, or tablespace. In general, update locks are applied at the row or block level, while share or exclusive locks may be applied at all three levels. A special type of lock can be applied at the table, partition, and tablespace level. This is an Intent Share (IS) and Intent Exclusive (IX) lock. These are applied when a row or block/page lock is incurred and state at the higher level that this activity is occurring. It also prevents subsequent Share (in the case of IX) or Exclusive locks on the table/index, partition, or tablespace that has had the IS or IX lock obtained on it. ---------------------------------------------- LOCK GRANULARITY RDBMSs provide many levels of lock granularity to support different application requirements. The term granularity in this context refers to the amount of data being locked, which is different than the type of lock that may be applied. Lock types can be Share, Update, or Exclusive. Different databases have different granulariy options, but in general the common options are: - Row level--Individual rows are locked. This provides the lowest level of granularity and allows for the greatest level of concurrency. It is the most popular option for multiuser OLTP (online transaction processing) applications. It also has the most overhead and should be used only where needed. - Block/page level--This hybrid approach locks entire blocks (in some cases, pages). This option has much less overhead than row-level locking but results in more locking conflicts. If there are few rows per block or if the chances of the same blocks being accessed at the same time is low, this is a viable option. Many popular relational databases provided only this option as their lowest level of granularity until recently. The major factor in forcing DBMSs to provide row-level locking came from SAP's requirement for it before an RDBMS could become SAP-certified. In our next tip, we will look at partition, table, and tablespace locks. ---------------------------------------------- LOCK GRANULARITY: TABLE, PARTITION AND TABLESPACE LOCKS After row-level and block/page-level locking, the next most common levels of lock granularity are: - Table level--This could also be referred to as an object- or segment-level lock, since it may be a lock for an entire table or index. It uses a small amount of locking overhead but provides for little concurrency, since a single user can lock the entire table. This may be useful in a DSS (Decision Support System, such as a data warehouse) if a table is locked in share mode by all users. If this lock level is used, it may need to be changed when the data needs to be updated or loaded--unless the data is loaded sequentially without any other concurrent users. This approach is not reasonable for a multiuser OLTP system. - Partition level--This is similar to a table-level lock, except that only an individual partition of a table (or index) is locked. This option may be a good approach for DSS systems that are read-only and where the data loads can be run in parallel by partition. If this is the case, the locking overhead would be low, and the level of concurrency would be adequate. This option is not a reasonable approach for a multiuser OLTP system. - Tablespace level--This is similar to table- and partition-level locks, except that all tables, indexes, and partitions in a tablespace are locked. This option is used by some database utilities and may be chosen in rare cases for DSS systems. The overhead of these locks is minimal, but the concurrency provided is also minimal. This choice is not a reasonable approach for a multiuser OLTP system.
---------------------------------------------- THE LOCKING HIERARCHY Table Locks and Row/Block/Page locks are used together. A lock of type S on a Row may be combined with a lock of type IS on a Table (or tablespace). In the same manner, a lock of type X at the Row or Block/Page level will be combined with a lock type of IX at the Table level. The IX and IS table(space) level locks are Intent Exclusive and Intent Share locks, respectively. This means that a program is attempting to obtain a lower level lock. The advantage of placing the Intent lock at the table(space) level is that it can be quickly checked by a competing program. If the other program wants to obtain an X (exclusive) lock at the Table or tablespace level and an IX or IS lock level (or S or X lock for that matter) already exists, the program will need to wait until the lock is released, or it will time out or try to obtain the lock again. This hierarchy of locking levels simplifies the locking process and delivers performance benefits at the same time. ---------------------------------------------- LOCK ESCALATION AND LOCK PROMOTION Some databases (DB2) provide the features of Lock Promotion and Lock Escalation. Lock Promotion occurs when the DBMS optimizer notices that a program's lock isolation and data access characteristics are such that it will promote a lock from a low-granularity lock such as Row/Page/Block to a high-level lock such as to a Table/Partition/Tablespace lock. Lock escalation is similar, except that it occurs at runtime as a program accesses rows. If the number of low-level locks (Row/Page/Block locks) increases at a high rate to the point where the number of concurrent locks exceeds a threshold, the low-level lock can be escalated dynamically to a higher level (Table/Partition/Tablespace) lock. This can improve the program's performance, but it may severely restrict concurrent access to the same table. The lock escalation rules are DBMS-dependent but usually related to three things: database initialization parameters, Create Table or Create Tablespace parameters, and user directives when a given program is compiled. It is important that you become familiar with your database's lock escalation features when designing high-performance transactions. ---------------------------------------------- FOUR LOCKING PROBLEMS There are four common locking problems that can be resolved by database locking options: - Phantom Reads--A situation where phantom rows are created is where one transaction performs a query such as a count(*) and is expecting a similar result the second time it runs the query. If another transaction inserts a row and commits the change in between this transaction's two queries, this transaction will see different results from the two queries. - NonRepeatable Reads--This problem occurs when one transaction (Trx X) processes a table, and as it processes a set of rows--say by performing a function such as average or sum--another transaction updates some rows that were already processed and some others before Trx X gets to them, and then the second transaction commits. In this case, Trx X operates on data that is inconsistent, since part of it was correct before the commit and the rest is correct after the commit. This can return incorrect results. - Dirty Reads--No, this is not what you think. It refers to the ability to read data that has been updated by another transaction that is not yet committed. The problem with this is that the data may be invalid and rolled back by the other transaction. Meanwhile, this transaction operates on that invalid data. - Lost Updates--A lost update can occur when one transaction updates data, and before it commits the value, another transaction updates the same data and commits the change. The first transaction then commits its change but when the data is re-queried, the change is not present--but the change from the second transaction can be seen. In our next tip, we will look at some locking isolation options that solve these problems. ---------------------------------------------- LOCK ISOLATION Databases can provide four isolation levels of locking. The isolation level determines the level at which a given session can coexist with another. - Serializable - Isolation Level 3--The highest form of locking, where a transaction is completely isolated from other transactions. There are no possible phantom rows, nonrepeatable reads, dirty reads, or lost updates with this option. This guarantees that within a single commit point, the rows that have been read or updated will be locked for the entire duration of the transaction. So, if the same query is run twice by the same transaction, it will always deliver the same results. - Repeatable Read - Isolation Level 2--This type of transaction can read the same data more than once. If another transaction changes a row (insert or update) and commits it, and the first transaction repeats the read, it may return different rows. Phantom rows are possible, but lost updates, dirty reads, and nonrepeatable reads cannot occur. * Read Committed - Isolation Level 1--If another transaction changes a row (update) and commits it, and the first transaction repeats the read, it will return the same rows. However, some may have different values. This is known as a nonrepeatable read. This prevents lost updates and dirty reads, but it does allow nonrepeatable reads and phantom rows. - Read Uncommitted - Isolation Level 0--This transaction can read data changes made by other transactions that have not been committed. This allows phantom rows, nonrepeatable reads, dirty reads, and lost updates. The four possible locking isolation levels may not all be provided by your DBMS, since most provide a subset of these. These can be set in transactions through Locking Hints, the Set Transaction isolation level option, or by other means, as provided by your DBMS provider. ---------------------------------------------- WHEN TO USE REPEATABLE READ LOCKING Type 2 Repeatable Read locking will allow a program to reread data without changes. If, however, a new row is inserted within the range of the data already read and the query is reissued, the new row will be picked up and the results of the second query will differ from the first. This type of locking can be restrictive to other updating transactions. However, if locks are held for a short duration and if new inserts are made outside of the range of queries (for example, if inserts are always performed using an ascending key that is higher than the query high-key), this can be a useful locking technique. It removes the burden of checking if rows have changed from the application programmer, but it has the added potential of increased locking problems. ---------------------------------------------- WHEN TO USE READ COMMITTED LOCKING Type 1 Read Committed locking ensures that data from other transactions must be committed before they can be seen by this transaction. However, once committed, the change will be picked up by a subsequent query. This type of locking provides a limited amount of overhead, but it does not allow reads to be repeatable. This type of locking is valuable when: - The database is large and transactions are performed randomly against different parts of the database. In other words, the chance of competing transactions accessing the same rows is small. - The DBMS provides versioning support with low overhead. - The programs do not repeat reads. If programs do perform repeated reads and the DBMS does not support Versioning, then programs will need to check that repeated reads return the same results. If they do not, they will need to have restart logic built in. ---------------------------------------------- WHEN TO USE READ UNCOMMITTED LOCKING Database locks are not honored using this approach. Any change--whether committed or uncommitted--performed by another transaction will be read using this locking strategy. This approach is not recommended for production systems. One place where this may be useful is where test queries need to be performed in a nonintrusive manner without affecting other systems. Even in this case, however, the person invoking the query cannot and should not rely on the data integrity of the rows returned. ---------------------------------------------- LOCKS VERSUS LATCHES You may have heard the term latch in the same context as a lock and wondered what the difference is. Latches are similar to locks except that rather than preventing different programs from changing the same data (row), they prevent different programs from changing the same memory areas (blocks). These are extremely quick, and latch waits should be almost nonexistent in a well-tuned database. If a latch attempt fails, the attempt will be retried many times until a specified threshold has been reached. This process is referred to as a spin lock. Once this threshold is reached, the process will sleep for a specified period of time, and the latch attempt will then be retried after the session starts up again. Interestingly enough, it was a spin lock problem that caused performance issues in Sybase's System 10 when it scaled large systems in large SMP servers. I am pleased to report that this problem was corrected with Sybase's release 11. ---------------------------------------------- OPTIMISTIC LOCKING STRATEGY Optimistic and pessimistic locking strategies are based on the profile of the system being built. In cases where there is little chance of two programs or sessions accessing and updating data, an optimistic locking strategy may be employed. Under this approach, the program will fetch a row without holding a lock. The subsequent update statement will simply check that the row being updated has not yet been changed. The check can be performed on a Timestamp'd column or on columns that were part of the original selection criteria that should not have changed. This approach is useful if the percentage of updated rows to rows fetched is small, so that we do not want to incur a high degree of update lock on rows that are only being fetched. Next time, we'll look at pessimistic locking. ---------------------------------------------- PESSIMISTIC LOCKING STRATEGY The term Pessimistic Locking is used in cases where application designers feel that there is a high probability that two concurrent programs will need to Fetch and Update the same rows. Under this "pessimism," the Fetch'd row will be Locked so that the subsequent Update statement can be performed with integrity. The concurrent program will be Serialized and will wait for the first Fetch, Update, and Commit/Rollback transactions to be performed before it can proceed. Of course, we need to ensure that excessive queuing does not occur under this approach. The Select...for Update statement can be performed to accomplish this. ---------------------------------------------- SELECT FOR UPDATE When a Cursor is being created to Fetch rows that will likely be updated in a subsequent statement, the For Update Of clause may be used. This will place a U (Update) lock on the row/columns accessed that allows other programs to read the data but prevents other U and X (Exclusive) locks on the same data. The subsequent Update will use the Where Current Of Cursor clause. When using Select...for Update combined with Update...Where Current Of, watch for the following things: - The For Update Of clause should return only columns being updated for performance reasons. - The For Update Of clause must include all columns to be Updated or the subsequent Update...Where Current Of statement will fail. - Watch for any statement that will cause a cursor to be nonupdatable. Examples of these are any statements such as order by, group by, distinct, or union that cause a Sort operation. Other examples include table joins and some built-in functions. Under these cases, these statements will not be valid. The combination of Select...For Update with Update...Where Current Of is powerful when used properly. ---------------------------------------------- WHERE CURRENT OF CURSOR When a row is Fetched from an Open Cursor, it can be Updated or Deleted using a Where Current Of Cursor statement. This statement is very efficient in that it can access a row directly without a Where clause. The Update/Delete...Where Current Of Cursor can be used when a single row is being changed. An example of a Delete statement using this is shown here: Delete Customer Where current of customer_cursor; It is important to note that the Current Of Cursor positioning may be lost if the original cursor contained a statement that invoked a Sort operation (such as order by, group by, or distinct) or a statement that is nonupdatable (such as a join operation). ---------------------------------------------- STAR SCHEMAS, FACT AND DIMENSION TABLES We have discussed one popular form of data modeling and physical database design called Normalization. In the next few tips, we will discuss some popular data designs and SQL tips for data warehousing. Star schemas are a physical database design option commonly used in data warehouses. This design is called a star schema because it has a large, central fact table that is related to many (approximately three to 10) dimension tables directly, so that the model is similar in appearance to a star. It's important with this design that the dimension tables are all related directly to the single fact table. Fact tables are transactional and are where the numerical information of your business is stored. These may be thought of as Events. Examples of fact tables are invoices, sales, postings, and shipments. These tables are generally characterized as being long and narrow. In other words, they have many rows (millions to billions) and few, narrow columns. Dimension tables are where the descriptions of the business and the fact tables are described. These may be thought of as reference tables. Examples of dimension tables are customer, employee, time, and product. These tables are characterized as being short and wide. In other words, they have relatively few rows (compared to fact tables), but many columns that are relatively wide. Star schemas are a very popular data warehousing and data mart implementation approach. ---------------------------------------------- STAR SCHEMA OPTIMIZATION The major DBMSs have recognized the popularity of star schemas and have implemented star schema optimization. This optimization recognizes that there is a central fact table and that all dimension tables are directly related to the fact table. Two common approaches are used to perform star queries. The first approach assumes that the dimension tables are considerably smaller than the fact table. Under this approach, the dimension tables are accessed and filtered through the Where clause of the query. The dimension tables are then joined through a cartesion product--where every qualifying dimension row is joined to every other qualifying dimension row--and the result of this is sorted into a temporary table. The resulting temporary table of qualifying dimension rows is subsequently joined to the fact table very quickly through extremely fast sequential processing techniques. The second common approach toward star-query optimization is where the optimizer pushes all dimension table access down to separate subqueries. The dimension rows are filtered using the criteria in the query Where clause, and the result sets from these subqueries are in effect pushed up and joined to the fact table. Other database optimization techniques are available and provide high star-query performance. It is important to note that the industry has recognized the star-schema design to the point where it has implemented technology that enables it further. ---------------------------------------------- THE SNOWFLAKE SCHEMA The snowflake schema is similar to the star schema in that there is a central fact table that is surrounded by dimension tables. However, in the star schema, all dimension tables are directly related to the fact table--so, the data model resembles a star. In the snowflake schema, some dimension tables are related directly to the fact table, while others are only indirectly related to the fact table through the directly related dimensions. In other words, this model resembles a snowflake. The advantage of this design is that a Normalized model can be made to represent it with only small changes. It allows for more simple database updating and a higher degree of data integrity. Major disadvantages with this design is that it is a hybrid approach, which has neither the complete clarity of a Normalized model nor the simplicity and performance benefits of a star schema. Although a few databases provide special optimization for snowflake schemas, the optimization and performance potential of star schemas make it a more popular modeling choice for data marts given today's technology. ---------------------------------------------- THE HISTORY OF SQL--THE 1970S SQL was developed by Dr. E. F. Codd while he was with IBM, and his paper "A Relational Model of Data for Large Shared Data Banks" was published in 1970. His model become the definitive relational model that modern RDBMSs (and for a large part, ORDBMSs) are based on. IBM undertook a research project where the first relational database, called System R, was developed and along with this, the database access language Structured English Query Language, called Sequel. A revised version of Sequel--Sequel/2--was developed by 1977 and this was later named SQL. Though never commercialized, System R proved that relational databases were a viable product. It was implemented internally and at some customer sites as part of a joint research project. IBM developed the first relational database, but it was another company, Relational Software Inc., that developed the first commercial one. This included an implementation of SQL and was introduced in 1979. Relational Software later became Oracle Corporation. ---------------------------------------------- THE HISTORY OF SQL--THE 1980S The 1980s started with a few commercial implementations of relational databases and with major debates occurring between supporters of hierarchical and network modeled databases stating that relational databases could never perform to the level needed by commercial users. While this debate raged, many companies were busy developing commercial relational databases. Some of the major players included Oracle (Relational Software Inc.), Relational Technology Inc. (which became Ingres) in 1981, IBM with SQL/DS in 1982 and DB2 in 1983, DG/SQL (Data General) in 1984, and Sybase in 1986. Other popular names of the time included the Ask Group, Briton-Lee, Sharebase, Teradata, Supra, and Gupta. Until later in the decade, RDBMSs generally were used for read-only access. Data was often moved from operational databases to the RDBMSs--though the term data warehouse was not yet introduced--and these RDBMSs ran on large mainframe or minicomputers. In the late 1980s, this perception changed with IBM's version 1.3 of DB2. With this release, companies began to use RDBMSs for mid-sized to large OLTP (operational) systems. IBM and DB2 provided RDBMSs with much-needed credibility. Helping this along, the SQL language became an ANSI standard in 1987. The client/server architecture was introduced by Sybase with Microsoft, and Ashton-Tate developed SQL-Server. This architecture's best days were yet to come. ---------------------------------------------- THE HISTORY OF SQL--THE 1990'S The 1990s began with RDBMSs being the accepted popular databases of choice. They showed phenomenal growth and helped the computer industry develop more of a data-centered focus. The face of the industry, however, was changing. Sybase introduced the client-server architecture and many other companies followed very quickly, most notably Oracle, Sybase, Microsoft (together with Sybase), and Informix. One obvious absence in the race for client-server marketshare was IBM, which was very slow to respond to this changing and growing paradigm. That decision hurt IBM for a time, but the company is rebounding. Sybase was also responsible for the introduction of stored procedures and triggers in a commercial database. Throughout the 1990s, developers and analysts argued over the future of RDBMSs with SQL and objects, as well as the place of RDBMSs in the world of data warehousing, data marts, and multidimensional databases. The first of these is sorting itself out with the SQL99 standard and introduction of ORDBMSs. Postgres, whose founder, Michael Stonebraker, had founded Ingres in the early 1980s, developed the first of these databases. Stonebraker later founded Illustra, which was purchased by Informix and introduced ORDBMSs to the mainstream. Oracle and IBM now have commercial offerings that compete with Informix. ORDBMSs have proven themselves as data warehousing and data mart products in the last few years. This marketplace is still growing rapidly--although the revenue growth has not rivaled that of the early part of the decade. The introduction of VLDBs and appliance-sized databases (VSDBs), combined with replication features, Internet needs, the Java programming language, and ORDBMS capabilities, are helping to make SQL-based databases ubiquitous. ---------------------------------------------- LEVELS OF SQL STANDARDS SQL STANDARDS The American National Standards Institute made SQL an official standard in 1986 with ANSI X3.135-1986, Database Language SQL. This became known as SQL-86 and was adopted internationally through the International Organization for Standards in 1987. This was updated by ANSI X3.168-1989, Database Language--Embedded SQL, and ANSI X3.135-1989, Database Language--SQL with Integrity Enhancement, and became known as SQL-89. In 1991, SQL-92 (also known as SQL-2) was adopted by the U.S. government's Federal Information Processing Standard organization. SQL-92 defines three levels of compliance: Entry, Intermediate, and Full. Many popular commercial RDBMSs conform to Entry-level compliance, and have many features that conform to Intermediate or Full-level compliance. So, when your favorite DBMS vendor states that it is ANSI-SQL compliant, ask it what it means by this. In 1999, a major enhancement to SQL--the SQL3 standard (aka SQL-99)--was introduced. More on this in our next tip. ---------------------------------------------- THE SQL3/SQL99 STANDARD In 1999, SQL99--also known as SQL3--was adopted and contains the following eight points: - The SQL/Framework--SQL basic concepts and general requirements. - SQL/Foundation--The syntax and SQL operations that are the basis for the language. - SQL/Call Level Interface (CLI--An API for SQL. This is similar to ODBC. - SQL/Persistent Stored Modules (PSM)--Defines the rules for developing SQL routines, modules, and functions such as those used by stored procedures and triggers. This is implemented in many major RDBMSs through proprietary, nonportable languages, but for the first time we have a standard for writing procedural code that is transportable across databases. - SQL/Host Language Bindings--Define ways to code embedded SQL in standard programming languages. This simplifies the approach used by CLIs and provides performance enhancements. - SQL Transactions--Transactional support for RDBMSs. - SQL Temporal objects--Deal with Time-based data. - SQL Objects--The new Object-Relational features, which represent the largest and most important enhancements to this new standard. SQL3 has two levels of conformance: core SQL support and enhanced SQL support. ---------------------------------------------- OBJECT-RELATIONAL DATABASES These new databases combine the features of relational databases with the best features of object databases. Some of the critical features of relational databases that object databases had difficulty implementing include high-volume transaction processing for OLTP and DSS, strong data concurrency, advanced storage management, recovery and security management, high performance, strong application support, ANSI-SQL92 compliance, data sharing support, persistence, integrity, and logical and physical independence. Adding the concept of objects to this increases the capability of relational databases to include any type of data object. The object model is included with the relational model, but made persistent in a relational database. SQL needed to be enhanced greatly to name an object and use its methods. This allows many new data types to be included in ORDBMSs, such as audio, video, text (large documents), and image and user-defined (abstract) data types. We will cover more on these new features in upcoming tips. ---------------------------------------------- SQL3 NEW DATA TYPES With SQL3, tables can store standard data types as before, but the SQL3 standard has added to this capability. Tables may be implemented as irregular, denormalized structures. For example, a single column may be implemented with rows, objects, or abstract data types (ADTs) in it. ADTs may be the most significant change seen in SQL3. These have two parts: an attribute specification and an associated function specification. ADT behavior is encapsulated, since all of its behavior is defined within the context of the ADT. Other new data types included in SQL3 are enumeration, Boolean, character large object, binary large object, distinct types, row types, set types, multiset types, and list types. ---------------------------------------------- SQL3 ADTS WITH FUNCTIONS, PROCEDURES, AND MODULES Abstract data types (ADTs) have been enhanced through the implementation of functions, procedures, and modules in SQL3. An ADT always has a Constructor routine defined to it, which is the logic needed to create a new instance (such as a new row) of the ADT. Each ADT program is called a method, and these methods can be implemented using procedures. A given procedure can have input parameters passed to it, output parameters returned from the procedure, and in-out parameters that are used to both receive and return values from the procedure. Functions may also be written, which can receive a single value and return a single result, as do functions in most common programming languages. ---------------------------------------------- PERSISTENT STORED MODULES SQL3 adds procedural syntax to the SQL standard. You may have seen this already through proprietary implementations such as Transact-SQL and PL/SQL. SQL3 has become a full-functioning programming language by adding the following constructs: Begin...End; While; If...then..else; If..ELSEIF...ENDIF; Repeat; Flow; Loop... End Loop; Case...When...When...Else...End Case; For...Do...End For Persistent stored modules (PSMs) can contain functions and procedures, as well as shared declarations that can be used for cursors, variables, and tables. These modules can be implemented as persistent objects in the database. These PSMs can be called by any SQL statement, function, procedure, or module. Exceptions can be handled in the PSMs, and they have a SQLSTATE (similar to SQLCODE's) associated with them. In the next tip, we'll look at the impact of SQLSTATE and SQLCODE on the new SQL standard. ---------------------------------------------- ERROR MESSAGES: SQLSTATE AND SQLCODE Popular relational DBMSs support SQLCODE as well as SQLSTATE. Most applications in production use SQLCODE checking, which has been discontinued with SQL3 and replaced by SQLSTATE. SQLSTATE defines a five-character code: The first two characters are the Class or Category of the message, and the next three are the Subclass. These messages are consistent across databases and will provide more portability of error checking. All exceptions should therefore be explicitly defined and have a SQLSTATE associated with them. Any new applications that are being coded with a database that supports SQLSTATE should use this standard, rather than those of SQLCODE that have been previously used. ---------------------------------------------- THE TRIM FUNCTION Sometimes you might like to remove leading or trailing characters from a string. This can be performed using the TRIM function, which can take two forms: - TRIM ('character string') Removes leading and trailing blanks. An example of this is: TRIM (' ABCDE ') returns 'ABCDE' - TRIM (operation 'characters' FROM 'character string') In this situation, the operation may be BOTH, LEADING, or TRAILING. An example of this is: TRIM (LEADING '*' FROM '**ABCDE****') returns 'ABCDE****' Different RDBMSs may have their own proprietary implementations, such as RTRIM and LTRIM for Right Trim and Left Trim. ---------------------------------------------- MICROSOFT SQL SERVER BUG Microsoft has discovered a bug in its Microsoft SQL Server OLAP Services version 7.0 that may cause you problems. When a cube or dimension is processed using the DTS Addin, and the processing fails for some reason, the error message that displays is incorrect. An operating system error message displays instead of the Decision Support Object (DSO) error message. For example, if the program fails with the 206 error "Process operation failed," the following operating system 206 error displays instead: "The filename or extension is too long." Similarly, the 207 "Internal error" displays as "The ring 2 stack is in use." Microsoft has confirmed this problem in SQL Server OLAP Services version 7.0 and is working on a fix. ---------------------------------------------- SQL3 FUNCTIONS AND ABSTRACT DATA TYPES SQL3 functions may be associated with abstract data types (ADTs) and can compute single-value items. These can have parameters passed to them, but this is optional. Under SQL3, these functions may be completely written in SQL and stored as persistent stored modules (PSMs). Other options for creating functions include encapsulating them in an ADT, writing them in external host languages or as destructors (remember that constructors can create an ADT instance and destructors perform the opposite function), and creating them as actors on an ADT (Select, Update, Create). A SQL3 function may therefore be created and used in almost every conceivable way. ---------------------------------------------- TABLE EXPRESSIONS One feature relatively new to SQL is the concept of table expressions or inline views. Consider this join: select c.customer_name from customer c, customer_address a where c.customer_id = a.customer_id and a.postal_code = 'A1A1A1'; This could be rewritten by coding an inline view as follows: select c.customer_name from customer c, (select * from customer_address where postal_code = 'A1A1A1') as a where c.customer_id = a.customer_id; Notice that the inline view uses a correlation name of a, which represents a result set of rows from customer_address where postal_code = 'A1A1A1'; This may not seem like a large savings, but the concept of the inline view can prevent creating database views and simplify coding logic in cases where complex processing is performed on multiple tables. ---------------------------------------------- SQL3 COMMON TABLE EXPRESSIONS SQL3 common table expressions are similar to table expressions in that the expression is defined when needed and not stored as a persistent object in the database. If the common table expression is to be reused in a program, the advantage is that the expression does not need to be reevaluated each time it is used. The expression is given a name in the program (which can be written completely in SQL) and the name is referenced in the program, as shown here: WITH Customer_Addresses AS (select * from customer_address where postal_code = 'A1A1A1') Select customer_id, address_line1, address_line2, city from Selected_Addresses; Notice that the common table expression is defined by WITH and subsequently used in another query. With this approach, it is not required to drop the temporary table or view at the end of the program. ---------------------------------------------- THE CAST FUNCTION The Cast function can be used in SQL to convert one data type to another. This action takes the form of CAST (expression AS (datatype or domain)) This function can be helpful in converting the precision of a numeric data type or converting a character data type to a date format or a number. An example using CAST is shown here: Select CAST (SUM (total_invoice_value) as decimal(12,2)) From Table_name; ---------------------------------------------- THE SUBSTRING FUNCTION It's possible to extract a portion of a string of characters using the SQL Substring function. Here is the form: Substring (string FROM starting_character FOR number_of_characters) An example of this is Substring {'ABCDE' FROM 2 FOR 2) would return 'BC'. As always, this syntax varies slightly from database to database. ---------------------------------------------- POSITION FUNCTION The Position function may be used to return the starting position of a string within another string. This information can then be used to extract data from that substring. It can also be used to determine the existence of the string. An example of Position being used with binary data is Select Position(B'001' IN B'11110011') FROM Table; This would return the value 5. If the string were not found, Position returns a 0. ---------------------------------------------- CONCATENATING VALUES Two strings can be joined together using the concatenation operator ||. The strings must both be of the same data type, and this can be accomplished by using the CAST command. Strings may be concatenated as follows: Select address_line_1 || ', ' || city >From Address; This, for example, would return the following: 123 Cherry Lane, Maintown 456 First Street, Newtown ---------------------------------------------- TRANSLATE FUNCTION One character string may be converted character by character to a corresponding equivalent character string using a predefined translation mapping (which is used for data conversion reasons). An example of this statement is Translate('ABCDE' USING new.translation_rtn) The translation routine (new.translation_rtn) can be created using the Create Translation statement, which defines a user-defined translation. This translates the source and target character sets as well as the source of the translation. The format for this is Create Translation <translation_name> For source <character set name> To target <character set name> >From <translation source> The translation source is a predefined routine. ---------------------------------------------- SECURITY ROLES Database security can be difficult to manage and very granular in nature. Consider that each user may need to have security privileges granted to each table that he or she accesses, and that this will include select, insert, update, and delete authority. Users will also need to have security access to each object that they need to access, as well as whatever overall system privileges they may need. After multiplying these privileges by all users, there may be thousands of security grants that have been issued. On top of this, dropping and re-creating an object will cascade to remove any security that was previously granted to that object, creating a huge maintenance complication. How do we handle this potential security maintenance nightmare? SQL3 has introduced the concept of security roles, which is a way of grouping security and attaching users to that group. The role is created, security is granted to the role, and the role is then granted to a user. This allows easier administration of security grants and users. An example of the create role statement is Create Role role_name with admin current_user; This statement creates a role and gives it admin privileges for the current user. Database security can be difficult to manage and very granular in nature. Consider that each user may need to have security privileges granted to each table that he or she accesses, and that this will include select, insert, update, and delete authority. Users will also need to have security access to each object that they need to access, as well as whatever overall system privileges they may need. After multiplying these privileges by all users, there may be thousands of security grants that have been issued. On top of this, dropping and re-creating an object will cascade to remove any security that was previously granted to that object, creating a huge maintenance complication. How do we handle this potential security maintenance nightmare? SQL3 has introduced the concept of security roles, which is a way of grouping security and attaching users to that group. The role is created, security is granted to the role, and the role is then granted to a user. This allows easier administration of security grants and users. An example of the create role statement is Create Role role_name with admin current_user; This statement creates a role and gives it admin privileges for the current user. Database security can be difficult to manage and very granular in nature. Consider that each user may need to have security privileges granted to each table that he or she accesses, and that this will include select, insert, update, and delete authority. Users will also need to have security access to each object that they need to access, as well as whatever overall system privileges they may need. After multiplying these privileges by all users, there may be thousands of security grants that have been issued. On top of this, dropping and re-creating an object will cascade to remove any security that was previously granted to that object, creating a huge maintenance complication. How do we handle this potential security maintenance nightmare? SQL3 has introduced the concept of security roles, which is a way of grouping security and attaching users to that group. The role is created, security is granted to the role, and the role is then granted to a user. This allows easier administration of security grants and users. An example of the create role statement is Create Role role_name with admin current_user; This statement creates a role and gives it admin privileges for the current user. ---------------------------------------------- UPPERCASE AND LOWERCASE TRANSLATION When comparing values, you may be concerned that the data in two strings is not of the same case. In other words, one string may have some lowercase values that are capped in the other string, which would cause the strings to be unequal. For instance, A is not equal to a. The way to correct this is to use the Upper or Lower functions to ensure that the two strings are the same. An example of this is to convert a column that may be stored in cap and cap/lowercase to compare it to a string: Select last_name >From customer Where upper(last_name) = 'JONES'; You will need to be careful about index usage in this case, since an index may not be used for the function as demonstrated here--but this will return all Customers named 'JONES'. ---------------------------------------------- DATE AND TIME DATATYPES IN SQL SQL3 supports three date and time data types--date, time, and timestamp. The date data type stores four-digit years in the form of YYYY, a two-digit month, and a two-digit day. The time data type includes hour, which ranges from 00 to 23; two-digit minutes from 00 to 59; and seconds in the form of 00.000. Timestamp combines both date and time in the same data type. Most databases implement these differently from this as well as differently from each other. However, they all have useful functions to help handle date arithmetic. We will look at this in our next tip. ---------------------------------------------- DATE AND TIME ARITHMETIC Intervals may be used to add and subtract date/time columns. The interval may be in the form of Year, Month, Day, Year to Month, Month to Day, Hour to Minute, and so on. Note that when specifying multiple intervals such as Hour to Minute, the first must always be the larger time interval than the second one. The example Interval '01:24' HOUR TO MINUTE is the same as saying "1 hour 24 minutes." An example of date arithmetic is shown here: Select current_date + Interval '31' DAY >From Customer Where City = 'New York'; This example yields a new date column and does not require the programmer to perform date arithmetic. Here is another example: Select Timestamp_column + INTERVAL '8' HOUR >From table_name; This example returns the Timestamp_column with 8 hours added to it. If the date changes, it will be handled by date/time/timestamp arithmetic. Select Date_column + INTERVAL '2' MONTH from table_name; adds two months to the value in date_column. Select (DATE '1999-01-31' - DATE '1999-02-28') YEAR From... returns INTERVAL '00' YEAR. ---------------------------------------------- DATE, TIME, AND TIMESTAMP EXPRESSIONS AND RESULTS When you are trying to determine the results of date/time/timestamp arithmetic, this may help unravel some of the mystery. DATE/TIME/TIMESTAMP DATETYPE EXPRESSION RETURNED RESULT date/time/timestamp +- interval = date/time/timestamp date/time/timestamp - date/time/timestamp = interval interval +- interval = interval interval */ number = interval Therefore, we see the following examples: DATE '1999-01-30' + INTERVAL '1' DAY yields '1999-01-31' DATE '1999-01-30 - DATE '1999-01-28' DAY yields INTERVAL '02' DAY ---------------------------------------------- DATE AND TIME BUILT-IN FUNCTIONS The following datetime functions are supported in SQL3 and do not necessarily behave in the same manner as they did in SQL-92. All databases implement these functions differently, and it's important for you to understand the implementation in your DBMS as well as the future direction of the vendor. If they have promised to be SQL3-compliant, then this information will be important to you. - Set Time Zone: Allows for setting a time zone in this SQL session. If this is not set, the default time zone will be the one set in the database initialization parameters. This setting of time zones is not in itself a SQL function, but it is an important statement used to distinguish between some of the functions listed here. - Current_Date: Returns Current Date--that is, it returns today's date. - Current_Time: Returns the time to fractions of a second, with a time zone offset to account for potential time zone differences in the local session. This returns a value of Time With Time Zone. - Current Timestamp: Returns the timestamp (date and time) to fractions of a second, with a time zone offset in the same manner as Current Time. This returns Timestamp With Time Zone. - Localtime: Returns the time "here and now"--that is, it is the time as set in the local session without requiring a time zone offset. - Localtimestamp: As with Localtime, this returns the timestamp, taking into account the local session's time zone without requiring the time zone offset. In SQL-92, Current Time and Current Timestamp operated in the same manner as LocalTime and Localtimestamp operate in SQL3, respectively. ---------------------------------------------- WORKING WITH OVERLAPPING DATES If you've ever had to work with two sets of dates to determine whether they overlapped, you've probably had to think about this a little before coming up with a relatively simple statement, as shown in our example. This example determines whether two dates (start_date1 and end_date1 from TableA) overlap with another pair of dates (start_date2 and end_date2 from TableB). All that is required to determine this is Select a.* >From Tablea A, Tableb B Where B.start_date2 <= A.end_date1 and B.end_date2 >= A.start_date1; This code, though simple, can be confusing and may not perform well. ANSI SQL-92 and SQL3 have given us an Overlaps function, which does this for us. It determines whether the date/time data types overlap, and it also works with intervals and returns a True or False result. An example of the Overlaps function is Select table_id >From Table where (start_date, end_date) Overlaps (Date '1999-06-30',INTERVAL '10' MONTH); ---------------------------------------------- DEFAULT CLAUSE WHEN CREATING AND ALTERING TABLES When creating or altering a table, columns may be given default values so that if no value is provided by an insert statement, the default will be used. The following are valid defaults as defined in SQL3: literal string; current_date, current_time, current_timestamp, localtime, localtimestamp; user, current_user, system_user, session_user; current_path; array; null. Note that these are set to literals or pseudo-columns, and that table lookups cannot be performed to set the default value. The only way to accomplish that is through triggers. ---------------------------------------------- DROP COLUMN STATEMENT A worthwhile statement, the drop column statement may give you the flexibility to drop a column without needing to unload data, drop a table, re-create the table (and all dependent objects), and reload the data. The Alter Table... Drop Column column_name command will alter the table's definition and remove the column if no dependent objects (views, triggers, PSMs, constraints, or assertions) are on it. The drop may be performed and dependent objects removed if the Alter Table... Drop Column column_name CASCADE;' clause is used. ---------------------------------------------- UPDATEABLE VIEWS Contrary to some common thinking, database views can be updated, inserted to, and deleted from. However, the rules for changing data through views need to be fully understood. These rules include: - The updating user or program must have Insert, Update, or Delete authority on the underlying view. - The userid used to create the view (that is, the view creator or owner) must have the required Insert, Update, or Delete authority on the underlying table. - The view must access a single table in the From clause. - Only column names can be listed in the Create View statement select list (literals and functions are not allowed). - No grouping can be performed, so the view cannot include Group By, Having, or Distinct clauses. - On an Insert statement, any columns not explicitly updated must be defined as Not Null or they must have predefined default values. - The view being updated must be a subset of a single table's rows and columns. - If a view is created on another view, the above rules must be followed through the entire chain of views to the table.
---------------------------------------------- UPDATING VIEWS WITH CHECK OPTION When a view is created, rows in the underlying table may be eliminated through the Create View Where clause, as in Create View good_customers as Select * from customers Where total_sales > 1000000; In the case of the above view, any customers with total_sales <= 1000000 will not be selected. However, it may surprise you to learn that a row can be updated to contain a value that is not in the view range. For example, a row in the good_customers view with total_sales > 1000000 can be updated to contain a value <= 1000000. In other words, the following statement is valid: Update good_customers set total_sales = 900000; The irony of this is that these rows can never be retrieved again through that view. To prohibit this capability, we can use With Check Option to only allow changes to be made using the same rules that allow rows to be subsequently selected. That being the case, the following statement will make the previously shown Update statement invalid: Create View good_customers as Select * from customers Where total_sales > 1000000 With Check Option; ---------------------------------------------- USING THE INSTEAD OF TRIGGERS IN ORACLE Oracle has a proprietary feature that allows views that violate normal update rules for views with their implementation of Instead Of triggers. These triggers take the Insert, Update, or Delete statement and rewrite it so that each table in the view is updated individually. If for example, the following view is created... Create View View1 as Select a.*, b.* >From Table1 a, Table2 b Where a.id = b.id; ...the following SQL Insert statement is invalid because it attempts to insert into a joined view: Insert into View1 Values (:variable_1, :variable_2, ... :variable_n); However, this rule can be bypassed and the insert successfully completed if the following Instead Of trigger is created on View1: Create Trigger testtrigger Instead Of Insert on View1 For Each Row Begin Insert into Table 1 Values (:new.value_list...); Insert into Table2 Values (:new.value_list...); End This is a proprietary implementation and cannot be transported in this manner to other databases. ---------------------------------------------- HOW DO RDBMSS PROCESS VIEWS Views are processed in one of two ways: They are processed as an inline view in a similar fashion to the way a macro may be processed by a compiler, or the view may be transformed or materialized. Today, we will talk about the first type--the macro or inline view. This is a very straightforward view that may be translated before runtime by directly translating the Create View statement into a corresponding Table statement. For example, you might use a Create View statement like the following: Create View View1 as Select column1, column2 >From Table1 Where Name between 'A ' and 'CZZ' ; In this case, a SQL statement on this view may look like Select column1 From View1 Where Name = 'ABC'; Now, this can be transformed in a simple manner to the following statement as the statement is being prepared: Select column1 From Table1 Where Name = 'ABC'; Unfortunately, all statements against views are not this simple, and the view processing can become rather complex. But we'll discuss that in our next tip. ---------------------------------------------- MATERIALIZED VIEWS In our previous tip, we discussed simple view transformation by the RDBMS. Today we'll look at more complex view processing where the view must be "materialized" into a temporary table at runtime. Some view processing cannot be performed at runtime because the view definition requires the actual data be processed before the result set can be queried by the Select statement. An example of this is a view that uses a Group By function, as in Create View View1 (column1, column2) as Select column1, sum(column2) >From Table1 Group By column1; The following Select statement may then be issued on the view: Select * From View1 Where column2 > 50000; To process this, the database must sort (to group) the underlying table, then it will create a temporary internal table that is transparent to the end user and will deliver the final result-set from the materialized temporary table. Once complete, the temporary table goes away. This clearly requires more overhead processing since a temporary table needs to be created, inserted into, and subsequently selected from. Good DBMSs, however, have quick methods for dealing with temporary tables. This view materialization is also only used when absolutely required to process a view's rows. ---------------------------------------------- AND VOTTABYTE IT IS Looking for a quick way to impress your friends and show how cool you are? Listed below are some terms that can be used to define storage space requirements. You're probably familiar with TeraBytes and may even have heard of PetaBytes (no, a PetaByte is not illegal), but what about the other terms? Here they are: Storage Space Terms Kilobyte (K) = thousand = 10**3 = 1,000 Megabyte (M) = million = 10**6 Gigabyte (G) = billion = 10**9 TeraByte (T) = Trillion = 10** 12 PetaByte (P) = Quadrillion = 10**15 ExaByte (E) = Quintillion = 10**18 ZettaByte (Z) = Sextillion = 10**21 YottaByte (Y) = Septillion = 10**24 (** indicates exponent, or "to the power of") So, whatever happened to my favorites: Zillion and Gazillion? ---------------------------------------------- ASSERTIONS An assertion is similar to a check constraint in that it restricts the set of valid values that may be entered into a table or column. The main difference between an assertion and a constraint is that the assertion may be written on a single table or on multiple tables. Any time that a check constraint must be written using SQL, you should consider using an assertion. An assertion is created with the Create Assertion statement, as in Create Assertion new_assertion Check (Exists(Select sum(invoice_amount) from tableX) > 10); ---------------------------------------------- BOYCE-CODD NORMAL FORM You've heard of Data Normalization and 3rd Normal Form (3NF), but what is Boyce-Codd Normal form (BCNF)? This is a subtle variation and enhancement to 3NF--and it's a good check to perform on a table that you thought was normalized but perhaps just doesn't look quite right. By the way, a table is not in BCNF when you have a concatenated "superkey" containing columns that could be removed and still provide uniqueness. If this is the case, the table may need to be restructured so that the Key (superkey above) contains only those columns needed to provide uniqueness. This may involve splitting up the table. Here's an example: A student basketball player with a grade point average (GPA) of 4.0 who averages 20 points per game (PPG) gets one week of vacation as an incentive. Another with a 3.5 GPA and 20 PPG gets three days off. A table with relationships is shown below. Student Basketball Player GPA PPG Reward 4.0 20 1 week off 3.5 20 3 days off 3.0 15 2 days off 2.5 15 2 days off The key for the above could be a superkey of GPA and PPG. The relationships, however, are not exactly correct. To add a reward, we need both GPA and PPG. We cannot add GPA with a reward without PPG, and we cannot add PPG and reward without GPA. We can create two tables out of this: GPA_Reward GPA REWARD 4.0 1 week off 3.5 3 days off 3.0 2 days off 2.5 2 days off Reward_PPG REWARD PPG 1 week off 20 3 days off 20 2 days off 15 The above allows us to add "GPA and reward" or "reward and PPG". It also allows us to remove any of these entries without losing any information. If we join the tables together by the Reward columns, we get the same results as in the first diagram above. The initial table may still need to be created to resolve the many-to-many relationship between Reward_PPg and GPA_Reward. ---------------------------------------------- CLUSTERS A SQL Cluster is a little-known feature that includes a group of catalogs available to a SQL session at a point in time. It is a concept that can be thought of as a container or logical grouping of all the data you may want to access through a given server. The Cluster consists of zero or more Database catalogs. If the database object hierarchy were analyzed, Clusters come at the top of the list. The hierarchy contains: 1) Cluster 2) Database Catalog 3) Schema 4) Domains, Tables, Views 5) Columns The Cluster name implicitly qualifies the names of the objects that belong to it and cannot be explicitly stated. Three statements relate to a Cluster: Connect, Set Connection, and Disconnect. Your database may not contain this implementation of Clusters--and that isn't important--but it is a SQL-92 and SQL3 (SQL99) standard. ---------------------------------------------- COALESCE FUNCTION Coalesce is a SQL function that scans a list from left to right and returns the first not-null value. The best way to explain this is with an example. Consider the following SQL statement: Select Coalesce(last_name, 'Not Known') From Customers; This statement will return column last_name if it has a non-null value; otherwise, it will return 'Not Known'. This statement can contain more than the two values shown in the list. Coalesce is useful in situations where nulls cannot be returned by a query. This function is similar to Oracle's null-value-logic (nvl). ---------------------------------------------- COLUMN CHECK CONSTRAINTS Column check constraints can be set to limit the values to which a column can be set. The column's valid values can be restricted by the constraints to which the column is set, or to which a domain used by that column is set. These provide a simple way to validate data without affecting your application code. Here's an example of column constraints: Create Table new_table (new_column char(1), constraint new_constraint check(new_column in ('A','B','C')); not deferrable; A check constraint can only be set to values that may be determined at creation and compile time. In other words, the value of the check constraint cannot change at runtime. This prohibits the use of SQL calls to other tables or pseudo columns (niladic functions), such as current_date or current_user, which can change at runtime. ---------------------------------------------- COLUMN DEFAULTS Tables may be created or altered to allow default values to be set for columns. These default values will be used to update columns in cases where a new row is being created and a value is not set in the Insert statement or Database Load job. Here's an example of a default value being set for a column: Create Table new_table (new_column char(2) Default 'ON'); The ability to set a default can come in handy, but be careful to only do so in cases where the column can be safely set to a specific value and in cases where it is not explicitly set by an insert. Defaults can be set to literal values or to pseudo-columns (niladic functions) such as current_time, current_user, or user. However, a default cannot be set to a value in another table. In other words, the default cannot be a Select statement that returns a value. (To do this, you could create a trigger instead.) A default can be created on the column or on a domain that the column is based on. We will have more on domains in our next tip. ---------------------------------------------- CONSTRAINT DEFERRABILITY A constraint may be created as either Deferrable or Non-Deferrable. If the constraint is created as Deferrable, it will be checked at the end of every transaction. In other words, it will be checked only after all SQL statements have been completed in a Begin Transaction...Commit Transaction (or Rollback) block. The Not Deferrable constraint, on the other hand, will be checked immediately after the execution of every SQL statement. An example of a Not Deferrable constraint is Create Table new_table (new_column char(1), constraint new_constraint check(new_column in ('A','B','C')) not deferrable; It's recommended you use Not Deferrable constraints since you will be warned about constraint violations immediately after execution and can do something to correct the situation. If Deferrable constraints are used and a constraint is violated, the entire transaction will be rolled back and troubleshooting becomes more difficult. ---------------------------------------------- DISPLAYING INTERVALS It's possible to display intervals in SQL along with the results from the intervals. Take for example a situation where you want to find the number of customers who purchase a number of different items in intervals (or ranges) of 100 items. One approach is to hardcode ranges and use a Union statement to deal with each range interval. An example of this is Select '0 to 100' "Interval", count(*) "Customers" >From Customers Where items Between 0 and 100 UNION Select '101 to 200' "Interval", count(*) "Customers" >From Customers Where items Between 101 AND 200 UNION Select '201 to 300' "Interval", count(*) "Customers" >From Customers Where items Between 201 AND 300 UNION Select '301 and Over' "Interval", count(*) "Customers" >From Customers Where items > 301; The above query will deliver a result similar to that shown below: Interval Customers -------- --------- 1 to 100 22567 101 to 200 10952 201 to 300 6390 301 and over 710 The above query works but is limiting because we must hardcode each interval range and must leave an open-ended value to complete the query. There is, however, an alternative to this that is more flexible--and we will look at it in our next tip. ---------------------------------------------- DISPLAYING INTERVALS FLEXIBLY In the previous tip, we looked at one approach for performing queries on intervals. Today we will use the same example but will provide a more flexible query--and a more accurate result. Once again, consider the situation where you want to find the number of customers who purchase a number of different items in intervals (or ranges) of 100 items. Rather than hardcoding ranges, we can generate the range in this way: Select '0 to 100' "Interval", count(*) "Customers" >From Customers Where items Between 0 and 100 Select 100*TRUNC (items/100) + 1 || ' TO ' || 100*(TRUNC (items/100) + 1) "Interval", count(*) "Customers" >From Customers Group by TRUNC(items/100); Interval Customers -------- ---------- 1 to 100 22567 101 to 200 10952 201 to 300 6390 301 to 400 605 401 to 500 105 This approach is flexible and allows all ranges to be covered without previous knowledge of the data. In order to perform this query, your database will need to support a TRUNC or FLOOR function. What's more, you may use this approach but modify the query by using a Rounding function. ---------------------------------------------- DOMAINS IN THE DATABASE Domains can be created in a database to standardize the data types and properties that may be used for a specific column. This allows the same definition to be used for more than one column rather than having to implement the properties at the column level--which may potentially result in different, inconsistent data types. Here's an example of defining a domain: Create Domain new_domain AS char(2) Default 'ON'; Note that the above domain contains both a data type and a default value of 'ON'. In this way, any column created with this domain, and without a value, will automatically be given a value of 'ON'. The above domain can then be used in the Create Table or Alter Table statement as follows: Create Table new_table (column1 new_domain); ---------------------------------------------- DROPPING CONSTRAINTS Table and column constraints may be dropped simply through the Drop Constraint clause of Alter Table and Alter Domain. Assertions are dropped using the Alter Assertion...Drop Assertion... statement. Because of this, it is easier to drop constraints and assertions if we know their names, and well-chosen constraint names should be selected to facilitate database maintenance on the constraints. ---------------------------------------------- EVEN FASTER THAN INTERNET TIME Did you think that a millisecond was a millionth of a second? You may hear people at work saying things like, "That online transaction takes 500 milliseconds." What does this mean exactly? It means that the transaction takes 0.5 seconds. (This, by the way, could be considered a long time for an online transaction.) Here, then, is a list of times that are used to measure I/O, CPU, memory, and transaction speeds: Timing Rates Millisecond = thousandth = .001 = 10**-3 Microsecond = millionth = .000001 = 10**-6 Nanosecond = billionth = 10**-9 Picosecond = trillionth = 10**-12 Femtosecond = quadrillionth = 10**-15 Attosecond = quintillionth = 10**-18 Zeposecond = sextillionth = 10**-21 Yoctosecond = septillionth = 10**-21 = .000000000000000000001 (** indicates exponent, or "to the power of") Strangely enough, there may well come a time when we'll be expected to know each of these terms. ---------------------------------------------- FOREIGN KEY CONSTRAINTS Foreign key constraints are created to constrain (to the referenced Primary key values) the column values of a Foreign key. A Foreign key has referencing columns that in effect point to the referenced or Primary keys values, which may be thought of as the referenced columns. A table may have at most one Primary key, but it may have zero or more Foreign keys whose columns are referencing another table's Primary key, or perhaps even the Primary key of the same table (in the case of a self-referencing relationship). The goal of Foreign key constraints is to ensure that no dangling constraints exist in the database--in other words, to ensure that there are no Foreign key values without corresponding Parent key values. Foreign key columns cannot be of data types Blob, NClob, Clob, or Array. We will discuss some features of Foreign key constraints in upcoming tips. ---------------------------------------------- FOREIGN KEY MATCH FULL, SIMPLE AND PARTIAL The Match clause of a Foreign key is used to specify the degree of the required match between Foreign keys and the referenced key. The Match clause may take three forms: Match Simple, Match Partial, or Match Full. Match Simple is the default. These options are valid in cases where the Foreign key has null values. Match Simple makes a Foreign key valid if all columns are Not Null and match the referenced (Primary) key exactly, or if one of the Foreign key values is Null. Match Full makes a Foreign key valid if every Foreign key column is Null, or if all Foreign key columns are Not Null and match the referenced table. Match Partial is valid if at least one Foreign key column is Null and the rest of the Not Null columns match those of the referenced tables, or all Foreign key columns are Not Null and match the referenced table. To increase data integrity, consider using Match Full and making all Foreign key columns Not Null. In this case, all Foreign key columns must match the Primary key. ---------------------------------------------- FOREIGN KEY REFERENTIAL ACTION The referential action of a Foreign key defines the action to be taken on an update or delete of the referenced (that is, the Primary key) table. These are created for On Update and On Delete. Each tells the DBMS what to do in case a Referenced (Primary) column value is either updated or deleted. The referential actions that can be defined are Restrict, Cascade, Set Null, and Set Default: - Restrict does not allow for the deletion of a Primary key (referenced) row if a referencing (Foreign key) row exists with the same values. - Cascade allows a delete to the referenced (Primary key) column's rows and cascades the delete down to the referencing (Foreign key) rows, which are also deleted. - Set Null, like Cascade, allows a delete to the referenced (Primary key) column's rows and cascades the delete down to the referencing (Foreign key) rows, but rather than deleting those rows, it sets the Foreign key values to null. - Set Default acts such that DBMS sets every column in the referencing (Foreign key) column values to the column's default values. An example of this is Create table referencing_table ( column1 char(10) constraint new_constraint references referenced_table on Delete Restrict ); ---------------------------------------------- GLOBAL TEMPORARY TABLES Global Temporary tables are similar to the Local Temporary tables that we discussed in our previous tip. However, with Global Temporary tables we use the Global clause rather than the Module clause so that the table may be seen and used by more than one module. As with Local Temporary tables, we can specify the 'on commit preserve rows' clause or the 'on commit delete rows' clause. The first keeps rows across commits for a single SQL session, while the second removes all rows from the table at commit time. Both options will remove all rows at the end of the SQL session or program/module execution. The Global Temporary table can be used to store intermediate rows that can be used to simplify processing but that do not need to be kept in a persistent manner. These have less overhead than persistent tables and provide performance advantages when this type of processing is required. Here's an example: Create Global Temporary Table new_temp_table ( column1 char(10)) On Commit Preserve Rows; ---------------------------------------------- LIKE PERFORMANCE TIPS The Like statement can be used with wildcards to find data that is similar to a matching pattern. There are two forms of wildcard: '_' and '%'. These are different and their subtleties should be understood. The '_' matches a single character, while the '%' matches many characters. The '_' is faster than '%' since only one character needs to be evaluated and '%' must perform lookahead parsing. The "_" also works a little differently than '%' since it requires that a match exist, whereas the '%' does not have this requirement. Consider the following statement: Select * from table_name where 'NAMES' like 'Name_'; will match; but we won't get a match with the following: Select * from table_name where 'NAME' like 'Name_'; We would, however, get a match if the previous statement was coded as follows: Select * from table_name where 'NAME' like 'Name%'; Note: There is a tendency to use '%' with all Like statements. When you know the exact number of characters you're matching, it is more efficient to use '_'. ---------------------------------------------- LOCAL TEMPORARY TABLES Local Temporary tables may be used to create materialized tables within a program that may be used by--and only by--that program (that is, module). The Declare Table statement names a newly declared local temporary Base table. This can be a useful construct for storing intermediate data needed only by the module. Data is not persistent outside a single execution of the module. In other words, once a routine has completed, the data that was in the Temporary table will be deleted. A Temporary table can be used only in the SQL Module statement and is not visible outside the module. Here's an example: Create Local Temporary Table Module.new_table ( column1 char(10), column2 integer) on commit preserve rows; The expression 'preserve rows' allows for rows to be kept in the Temporary table across 'commit' points--but these will be deleted at the end of module execution. The 'on commit delete rows' option will empty the Temporary table at each Commit point within a Module execution. ---------------------------------------------- NOT NULL CONSTRAINTS Table columns can be defined as Null or Not Null. This constraint determines whether it is possible to allow null values in the columns, or if a valid value must be entered. It is important to note that in cases where a column is defined as Not Null, an application Insert statement or Database Load operation does not need to contain values for the column (in cases where a valid default has been created for the column). When determining whether a column can be defined as Null or Not Null, the following should be considered: - If a column is really mandatory, define it as Not Null. The opposite is true if the column is optional. - If the column is part of a Primary key, it should be mandatory and Not Null. - Null values are never equal to anything including other null values. - Be careful of using values in 'non null' columns, which are inferred to be the same as a Null. An example of this is to default a numeric column to zero, assuming that this means the lack of a value. This may confuse the meaning of the data and may incorrectly skew queries that use functions such as 'avg' (average). This type of problem occurs more often with numeric columns. - Some databases do not create index entries for rows that contain Null values. To access these, the table may be scanned. Check with your DBMS to see whether this is the case and test it as follows: Create Table new_table (new_column char(10) constraint new_constraint Not Null); ---------------------------------------------- OVERLAPPING DATETIMES The Overlap operator determines whether two sets of datetime values overlap. This is used to determine whether one interval overlaps another, and if they overlap, it evaluates to True. There are two possible formats: start_date (and/or time/datetime), duration start_date (and/or time/datetime), end_date (and/or time/datetime) Here's an example using both formats: Select e1.employee_name, e2.employee_name From Employee E1, Employee E2 Where (e1.start_date, Interval '14' days) Overlaps (e2.start_date, c2.termination_date); This will return those employees who had an overlap of start and termination dates. This can be used for the Date, Time, and Datetime datatypes. ---------------------------------------------- POSITION FUNCTION To determine where a particular substring begins within a string, you can use the position function. It can be used in conjunction with the length functions to determine the starting position of a substring in a string as well as the length of the substring. You can use the results of these functions as input to the substring command. The syntax is Position(search_string IN string) or Position(bit_argument_1 IN bit_argument_2) For example: Position(B'010' IN B'11100010') returns 6 select position('B' IN 'ABC') From table_name 2 ---------------------------------------------- RECURSIVE QUERY EXAMPLE Today, let's look at an example of a recursive query where we want to see all of the parts that make up a car. We will be examining the Car_Parts table, which consists of Parts and Sub_Parts, as shown below. Car_Parts PART SUBPART Car Engine Car Trunk Car Exhaust System Trunk Lock Engine Carburetor Carburetor Valve Exhaust System Muffler Based on the above Car_Parts table, find all subparts of a Car. To do this, we can use the following recursive subquery: With Recursive Car_Temp (part, subpart) AS (Select part, subpart From Car_Parts UNION Select inner.part, outer.subpart From car_parts outer, car_temp inner Where outer.part = inner.subpart) Select * from Car_Temp Where part='CAR'; To process the query, we need to first determine a starting point and let the results unravel from there. This query will perform the following steps: 1) Based on the last statement in the query: Select * from Car_Temp Where part='CAR'; and the first part of Car_temp: (Select part, subpart From Car_Parts, the following temporary table will be created: Car_Temp (aka "inner") PART SUBPART Car Engine Car Trunk Car Exhaust System 2) The inner query will then be joined to Car_Parts (aka "outer") by the following: UNION Select inner.part, outer.subpart From car_parts outer, car_temp inner Where outer.part = inner.subpart) Select * from Car_Temp Where part='CAR'; The Union removes duplicates, and the inner subpart is joined to the outer part. The temp table car_temp keeps getting added to as the subquery is processed; this returns the following result: RESULT SET PART SUBPART Car Engine Car Trunk Car Exhaust System Car Carburetor Car Lock Car Valve Car Muffler As you can see, this is a relatively simple example of recursion, but it still takes some time to analyze before you can understand what is going on. This is not an intuitive statement. Think of what it must be like to support a complicated recursive query that another person has previously written--which leads to this final tip: When adding recursive SQL to a program, document it thoroughly so that another person (or perhaps even you) can understand exactly what the query is doing. ---------------------------------------------- RECURSIVE SQL Recursive SQL is new to SQL3 and a great relief to many SQL programmers who have tried to resolve the famous "bill of materials" problem. This is the problem created by attempting to perform a parts explosion to an uneven hierarchy of parts where the number of levels is unknown. These self-referencing tables or self-referencing expressions are referred to as recursive queries. Problems with these queries in the past have resulted in poor program functionality, poor system design, and poor performance. It is important to note that recursion still has some problems of which SQL developers should be aware: - Negation crosses recursive queries and causes problems with the query. - Aggregation and grouping can cause problems if a value is changed during the query that affects the aggregates. - A statement that can cross the recursion, such as a Full-outer-join, should be carefully examined and tested to ensure that recursion is not compromised. These problems can be resolved by creating temporary, static tables--and in some cases by using common table expressions. The key is to remember the importance of understanding the exact steps being performed by the recursive query and to carefully test the query. In our next tip, we'll look at an example of recursive SQL. ---------------------------------------------- RETURNING LENGTHS OF STRINGS Three functions can be used to return the length of strings, which may be useful, for example, in a case where you want to determine the number of characters to return in a substring function. The three functions are: - char_Length(string)--Returns the length of a character string (when it is stored as a Character) as an integer value. - bit_length(string)--Returns the length of a character string in bits as an integer value. - octet_length(string)--Returns the length of a character string in octets as an integer value in octets. Here are some examples: Select char_length('A') From table_name; Returns 1 Select bit_length('A') From table_name; Returns 8 Select octet_length('A') From table_name; Returns 1 ---------------------------------------------- SCHEMA A schema can be thought of as the data model (the diagram) of tables and columns that group together to form a single logical database. In other words, it's the overall design of the complete database. A schema is optional and not necessary to create a table in the physical database--but it does help define the rules under which database objects can operate and interact (for example, the relationship rules and reverential integrity rules). The schema concept can be used by database administrators (DBAs) to help determine logical groupings of tables that can be used to back up and recover databases. In Oracle, a schema can be thought of as an Ownerid, while in DB2 and SQL Server, a schema can be thought of as a database (though the implementations of databases are very different from each other). The command to create a schema is Create Schema new_schema; ---------------------------------------------- The substring function returns a portion of a string from a specified starting point for a given length of values. We can, in other words, ask for n characters starting at position x of a string. An example of this is the case where you want the first three positions of a person's last name for a report. This would be done as follows: Select substring(last_name From 1 For 3) from Customers; The substring function is commonly used and supported by most DBMSs (the syntax may vary). ----------------------------------------------
TEMPORARY TABLES The Create Table statement defines a base table that is stored in a persistent manner. In this case, there is a new type of table that is introduced as a Temporary table. There are two types of Temporary tables: Local and Global. Temporary tables are not created until they are used in a SQL session. At this point, they can only be used by the SQL Session--or module--that created it. In other words, other users cannot use the table. Table names do not need to be specified in the 'Create Local Temporary Table' or the 'Create Global Temporary Table' statements. The Global table name will default to the Database schema name and session id, while the local table will default to the schema name, session id, and module name. ---------------------------------------------- THE MATCH PREDICATE The Match predicate is used to test reverential integrity (RI) before storing values in a table. This is done in SQL rather than through DDL. Here's an example: Select employee_name From Employee Where department_name Match (Select department_name From Department); The above ensures that an employee is part of an existing Department. This query may be run before a row is inserted into the table. Note that it is more efficient to implement RI on the tables and to simply perform the insert. The Match predicate comes in two flavors: Match Full and Match Partial. Match Full returns a True result if every value in the match is Null or if all values match. Match Partial is true if every value in the match is Null or if all non-null values match. The Unique and Match predicates can be combined to ensure that one, and only one, matching value is found: Select employee_name From Employee Where Unique department_name Match (Select department_name From Department); ---------------------------------------------- TRIM FUNCTION The trim function is used to remove the leading or trailing characters (or both) from a string. The default character string to be trimmed is the blank, unless otherwise specified. This function can remove any specified character. The syntax is TRIM(trim_specification FROM column_name) trim_specification = BOTH | LEADING | TRAILING Some examples using trim are: To remove leading and trailing blanks: Select trim(Both ' ' FROM city_name) from Address; To remove leading 0's: Select trim(Leading '0' FROM item_value) from Address; To remove trailing '*': Select trim(Trailing '*' FROM column_name) from table_name; ---------------------------------------------- UNIQUE CONSTRAINTS VS. UNIQUE INDEXES There is a subtle, but real, difference between unique constraints and unique indexes. A unique index can be created on a column or set of columns on a table. This will ensure uniqueness for those columns, but foreign keys cannot be created on unique indexes alone. A unique constraint can be created on a column, or set of columns, and a foreign key can be created that points to the unique constraint for reverential checking. The unique constraint will be implemented through the implementation of a unique index. This is done either explicitly through the create unique index statement or implicitly when the unique constraint is created. The unique constraint is used by data modelers since unique indexes can be shown on the relational model. ---------------------------------------------- UNIQUE PREDICATES There is a feature in SQL-92 that allows for a query predicate to determine whether a value in a subquery is unique. This is a different feature from unique indexes and unique constraints. If the result of a unique subquery returns duplicate values, the subquery will be evaluated as False; otherwise, it is True. Be aware that Nulls always will be evaluated to a True value with the unique constraint, which differs from other uses of Null. Here's an example: Select first_name, last_name From Customer C Where UNIQUE (Select last_name From Customer CU Where C.last_name = CU.last_name); The above is a correlated subquery that will return last_names in cases where that value is unique. You should note that this is not a widely implemented feature in current DBMSs. ---------------------------------------------- WHAT CAN GO INTO A SQL DATABASE With SQL3, Object-Relational database extensions, Persistent Stored Modules (PSM), and Used-Defined DataTypes (UDTs), relational databases have made great strides toward being able to store most forms of data. What's more, databases have become extensible and can now include various types of objects, such as XML objects, large text documents, video, audio, and image. Objects of these data types are being integrated with the standard structured data in most popular databases. With new Java support, wherein most databases allow PSMs to be written in Java, we're now seeing more data being loaded from traditional systems into the database. Additionally, new e-commerce uses (and other new data requirements) are being fed to large DBMSs with data generated from PDAs, cell phones, two-way pagers, and laptops. Lightweight databases combined with database replication and wireless communications are allowing this data to be moved to central sources quickly. Indeed, most major database vendors now have DBMSs that can run on Windows CE or Palm OS devices complete with database replication! All this is to say, most data can and will be supported by databases--which also means database sizes will continue to increase significantly. ---------------------------------------------- WORKING WITH CHARACTER DATA Character data comes in two forms: Character and Varchar. The Character (Char) datatype is fixed length while the Varchar datatype varies in length to match the length of the input data. In general, you may want to use Char datatypes in cases where the data length is relatively short or where a large percentage of the column will always be populated--in other words, where there is not a great savings of space when using Varchar. The Varchar datatype has an overhead of two to four bytes (to maintain the length of the stored column) for each column stored. It's important when working with character datatypes to know which type you're using. For example, there are cases where two columns with similar data are compared to each other but do not match. This is because Char data is padded with blanks while Varchar columns are not. In other words, take the case of a Char(10) or Varchar(10) column storing the string 'ABC'. The Char column will store it as 'ABC ' (that is, 'ABC' followed by seven spaces) while the Varchar column stores it as 'ABC'. In some cases, joins between these columns will not return a match. You'll also need to know whether Varchar columns are stored in your database as varying- or full-length columns. Different DBMSs vary in their approach. The rules of thumb for when to use Varchar also tend to vary by DBMS implementation. DB2 for OS/390 designers tends to lean toward using Char datatypes as an implementation option while Oracle designers favor Varchar (actually Varchar2) in that database. ---------------------------------------------- WHY YOU SHOULD CARE ABOUT SQLSTATE The SQLSTATE feature was added with SQL-92 to provide standard messages that can be used across DBMSs and to enhance error diagnostic information. A SQL status can come in the form of either SQLCODE or SQLSTATE. In the case of SQLCODE, a 0 indicates statement success, a negative value indicates that an error has occurred, and a positive integer provides information about a message or warning. The SQLCODE of +100 indicates that no (more) data was found. Besides these points, SQLCODEs differ greatly between DBMS implementations. The SQLSTATE is a 5-digit code. The 5-digit code is broken into two parts. The first two specify the class of the error, and the next three specify the subclass, or detailed error. Class 00 is Successful, Class 01 is a Warning, and Class 02 tells us that there is no (more) data. Some sample SQLSTATEs are shown here using three Class 08 SQLSTATEs. Class 08 messages are Connection Exceptions; some examples are 08000 connection exception 08001 connection exception - sql-client unable to establish sql-connection 08002 connection exception - connection name in use You can see that all 08nnn SQLSTATEs will relate to the same type of problem. SQL-3 has added condition handling to define how exceptions are defined. They include a SQLSTATE code and are also associated with a Signal or Resignal keyword. Consider using SQLSTATE in your code because it will provide more portability across different DBMS implementations. -------------------------------------------------------------------------------- WHEN TO DENORMALIZE One of my personal pet peeves is the tendency among programmers to quickly choose to denormalize data because it may improve database performance. This is not a good enough reason to denormalize, and database designers are moving toward the stance that they won't denormalize unless there's a good reason to do so. Here are some appropriate reasons: The system being implemented won't meet performance objectives. This is different from saying that system performance may improve. I'm willing to keep a normalized model with slightly degraded performance if the benefits of normalization can be achieved with acceptable performance. Other systems will be adversely affected and won't meet performance objectives. The system has become too difficult to work with because of the large number of normalized tables. This is a good reason to denormalize, but be sure that your denormalization doesn't introduce problems that also increase system complexity. -------------------------------------------------------------------------------- WHEN NOT TO INDEX There are times when you may want to create a new index but should avoid doing so. Some of these reasons are: Indexed columns are frequently inserted, updated, or deleted. The table is small and a full scan will be as quick as any index access. The number of indexes on the table creates an index population issue, and this index is not frequently used. An alternate index covers the same columns as the one you want to create. -------------------------------------------------------------------------------- WHAT'S A SQLCA When writing embedded SQL, you will need to code a SQLCA as shown below: main() { exec sql include sqlca; ... } SQLCA is short for SQL Communication Area. This is an included structure that is expanded and used to perform communication between your program and the database, stating the success and failure of each embedded SQL statement. Some sample values included in the SQLCA are shown below: SQLCODE: The return code passed from the database to the program. SQLERRM: A message relating to SQLCODE that has more information about the error. This is a group field that includes the message length and text as separate components. SQLWARNn: Some SQL warning messages. SQLERRD: Fields that contain error information SQLSTATE: The ANSI-compliant SQL return code, which may differ from the proprietary DBMS-dependent SQLCODE. -------------------------------------------------------------------------------- WHAT'S A SQLCA When writing embedded SQL, you will need to code a SQLCA as shown below: main() { exec sql include sqlca; ... } SQLCA is short for SQL Communication Area. This is an included structure that is expanded and used to perform communication between your program and the database, stating the success and failure of each embedded SQL statement. Some sample values included in the SQLCA are shown below: SQLCODE: The return code passed from the database to the program. SQLERRM: A message relating to SQLCODE that has more information about the error. This is a group field that includes the message length and text as separate components. SQLWARNn: Some SQL warning messages. SQLERRD: Fields that contain error information SQLSTATE: The ANSI-compliant SQL return code, which may differ from the proprietary DBMS-dependent SQLCODE. -------------------------------------------------------------------------------- VIEWS IN DB2 We've been discussing how to compile, or bind, programs in Oracle and SQL Server without qualifying the table with an owner ID. We've already looked at using the proprietary features of public synonyms and the dbo user. In order to allow a program with unqualified SQL statements to perform a bind (compile) in DB2 when the owner ID differs from the owner ID of the table, you're required to use views. Accordingly, if a table is called Customers, the owner ID is owner1 and the program binding ID is owner2, you would need to grant access to owner2 on owner1.Customers and would need to create views as user owner2 on table Owner1.Customers. This approach may seem cumbersome, but it's a simple thing to generate these views and it removes the need to qualify SQL statements with the table owner IDs. This approach is also a standard one that can be used with most other relational databases. -------------------------------------------------------------------------------- USING WHENEVER IN EMBEDDED SQL--PART 1 OF 2 A simple way to deal with SQL errors is to use the Whenever feature. The syntax is as follows: Whenever <condition> {GOTO | GO TO} <called process> Here's an example that uses this syntax: exec sql whenever sqlerror goto error_rtn; ...more statements... error_rtn: ... error processing ... exit(); In the next tip, I'll discuss some things to keep in mind as you use Whenever statements. -------------------------------------------------------------------------------- USING WHENEVER IN EMBEDDED SQL--PART 2 OF 2 In our previous tip, I showed you a simple way to deal with SQL errors by using a Whenever statement. This is a simple approach that can be used in development or in creating simple programs. However, you should remember the following when coding Whenever statements: More detailed and different instructions will likely be needed for different classes or errors. The called statement and routine should both be "in scope" (in the proper program block) so that processing isn't prematurely terminated. Whenever uses a GO TO or GOTO construct that can result in problems. The precompiler takes the Whenever statement and places an if...goto series of statements after every SQL statement. In some cases, rather than using Whenever, you might consider checking for errors after each SQL statement--or better still, create a common, intelligent routine that can perform the proper actions when different types of errors occur. -------------------------------------------------------------------------------- USING GROUP BY AND ORDER BY TOGETHER You may have occasion to group data to perform a function, but want data to be ordered in a way different from the grouped column. To do this, you can use Group By and Order By together to ensure that your data is ordered properly. Here's an example: Select postal_code, customer_id, count(*) From Sales Group by postal_code, customer_id Order by customer_id, postal_code; The above query groups the data on Input to perform the count(*) and then orders the data differently on the output to display the data by customer_id and postal_code. You may have run across cases where you've used Group By without Order By and the data is not displayed in the same order in which the columns appear in the Group By clause. To ensure that data is returned in the order you want, use Order By with Group By. (Remember, Order By sorts the output data.) -------------------------------------------------------------------------------- USING COUNT(COLUMN_NAME) As you saw in a previous tip, performing a count(column_name) does not return the count of columns that contain a Null value. It is important to make this distinction since this can be useful when you really do want to count values from a column that are not null. An example of this is a case where you want to see how many orders have been paid, as shown below: Select count(paid) From Orders; The above query can have also be written as follows: Select count(*) From Orders Where paid is not null; Even though column_name can be used, I almost always code this query the second way, using count(*). -------------------------------------------------------------------------------- USING COUNT() To count the number of rows returned by a query, use count(*). This gives the optimizer the opportunity to choose how to perform the count, and it can lead to great optimizations if indexes can be used and if certain columns are already not null. If you choose a column that contains null values, a count(column_name_containing_nulls) query will return only those columns that are not null. In other words, count(*) counts the number of rows returned from the result set, while count(column_name) returns a count of not null values for a specific column. Count(*) may also be more efficient. -------------------------------------------------------------------------------- USING A FUNCTION IN SQL Some databases allow you to use a function inside a SQL statement. This can be a handy feature in some cases where the function exists and where it is complicated to duplicate the logic in the SQL statement. An example of a SQL statement that uses a function or package is shown below (this is taken from an Oracle example): SELECT * FROM Customer WHERE customer.state = state_package.active_state; In the above SQL statement, the state column on the Customer table is compared to the result of the value of active_state in the state_package. In this example, the use of the function is not obvious. The expression state_package.active_state looks like a table and column name rather than a function call. In fact, this is a call to a package rather than a function, but the result is the same. There are function calls that will look more obvious, such as the one shown below where the value in the program variable active_state_variable is passed to the state_function. The result of this call is then compared to the value in column customer.state. SELECT * FROM Customer WHERE customer.state = state_function(active_state_variable); An alternate approach to this would be to move the state_package.active_state value to a variable and use that variable in your program. One thing to remember when using this approach is that the function must be called for every row retrieved in the table. So, if this is a situation where the result of the function call will always be the same, move the result to a variable and compare the variable to the column in the SQL statement's Where clause. -------------------------------------------------------------------------------- USE DBMS FEATURES TO GENERATE KEYS When possible, use DBMS features to generate keys. These features may be identity columns or sequences, or may use pseudo-columns. The advantage of this is that any program that accesses the database will have access to the feature or algorithm when inserting data to the database. In the past, this was not done in some instances, such as mainframe systems that used CICS Temporary Storage Queues (tsq) to generate keys. While the results were extremely fast, it was limited to CICS programs. Being a DBA, I always try to use database features when working in the database, unless I know the approach cannot work or will cause future problems. -------------------------------------------------------------------------------- UNIQUEIDENTIFIER IN SQL SERVER The Identity keyword shown in our previous tip allows us to create a unique key in a table. SQL Server lets us create a key that is unique globally with the Uniqueidentifier keyword. As you might expect, this is larger than Identity and is a 16-byte column (it stores 32 hex characters). When displayed, this key has four dashes inserted into it. Only one of these can be created in a table and could be done as follows: Create Table Table1 ( Table1_id uniqueidentifier, Table1_Desc char(50)); An insert of a new row uses the NEW() function to set this new column. -------------------------------------------------------------------------------- TRUNCATING A TABLE WITH CONSTRAINTS Truncating a table with referential constraints cannot be performed until the constraints are either dropped or disabled. The reason for this is that Truncate is a table-level statement, while the referential constraints are at the row level. The statement Alter Table table_name Disable Constraints referential_constraint_name must be performed before truncating the table. Another option is to drop the constraints, but I prefer disabling and re-enabling them since the constraints and constraint rules are still stored in the database catalog and the re-enabling of the constraints can be re-generated from the catalog. -------------------------------------------------------------------------------- TRIGGERS PROS AND CONS Triggers can be used to implement Business Rules and Referential Integrity (R.I.). A trigger is attached to a table and can be "fired" (or executed) upon an Insert, Update, or Delete to the table. This is determined by the Create Trigger definition. Triggers can be used to enhance R.I. as implemented through DDL. They can be used to complement DDL-implemented R.I.--they can replace them or they can be used to implement other Business Rules. Triggers can also call Stored Procedures or User Defined Functions (UDFs) to enhance their functionality. Triggers need be written and implemented only once, after which they are attached to a table and shared by all applications. As you can see, they are a powerful and flexible feature, but there are some disadvantages to triggers: Triggers add complexity to the database. The database setup becomes more complex. Business Rules become hidden. You may not see the Business Rules inside a trigger when analyzing a data model, nor will you have access to all of the processing being performed when you analyze application programs. Triggers may become disabled and may not be executing when you thought they were. This is a subtle and insidious problem that may not be detected for some time, unless you monitor the status of your triggers. Triggers can adversely impact performance. Before implementing triggers, be sure to keep these pros and cons in mind. -------------------------------------------------------------------------------- TRAVERSING A FAMILY TREE IN ORACLE Oracle has a proprietary family tree feature that allows you to traverse a hierarchy and a family-tree type of report with a single SQL query. This uses the connect by, start with, and prior features. Consider an explosion of a family tree from the root to the leaf levels: Our table will contain the columns parent_column and child_column. These will follow the relationship from great-grandparent to grandparent to parent to child, assuming that the same person can exist as both a parent and a child and that every person exists as a child (though not necessarily as a parent). We will start to build the tree from the oldest person, Adam, and will continue through the succeeding leaf levels in the following manner: Select parent_column, child_column >From Family_Tree start with child_column = 'Adam' connect by parent_column = PRIOR child_column; If our table contains the following: Parent_column child_column Unknown Adam Adam Steve Adam Bill Bill Mary Steve Helen Mary Ann the result below will be returned, indented to show the effect of the hierarchy: Parent_column child_column Unknown Adam Adam Steve Steve Helen Adam Bill Bill Mary Mary Ann This family tree does not need to be balanced. Notice the keyword Prior, which connects a column in a new row to a column in the previously selected row. -------------------------------------------------------------------------------- TRANSACTIONS AND COMMITS What exactly is a transaction? You can think of it as a single "Logical Unit of Work" where all of the data performed in the transaction must be committed or rolled back together. The atomic transaction is the smallest unit of work that must be committed together. Take the case of a single Insert and two Updates that must be committed or rolled back as a single business transaction; the pseudo-code (or logic) might look like this: Begin Transaction Insert into ... Update ... Update Commit Transaction End Transaction In other words, the Insert cannot be applied without both of the Updates also successfully completing. You cannot perform more commits than this, but you can perform less commits to improve performance. For example, when processing a large number of input transactions as a background task or batch process, you could execute a loop and commit every nth transaction (rather than every single transaction). You must, however, perform these commits around the same transaction boundaries. An example of this is shown below, where a commit is performed every 1000th transaction: Start Program Begin loop Insert into ... Update ... Update ... Set counter = counter + 1; If counter = 1000 Then Commit Transaction Counter = 0 End If End loop Commit End Program Do not perform the Commit between the Insert and Updates--this would violate your business integrity rules. -------------------------------------------------------------------------------- T-JOINS--PART 1 OF 2 T-Join is another term for a Theta-join, or a "best fit" join. The T-join was developed by Dr. E. F. Codd in 1990 and attempts to fit one list into another in a manner that most closely fits the related values in cases where there is no exact match. The following example borrows from Dr. Codd and tries to find the correct classroom to use for a given number of people, given the following Class and Rooms tables: CLASSES ROOMS class_key class_size room_number room_size C6 40 R6 60 C5 50 R5 55 C4 55 R4 45 Given the above tables, a T-join will find the Room in which each Class should reside to achieve a "best fit" so that class_size is less than room_size. There is not a standard T-Join function in SQL, so we'll need to create our own query that does this for us. In our next tip, we'll look at an example of SQL that will help us do this. The source for this tip is Joe Celko's excellent book, "SQL for Smarties: Advanced SQL Programming," 2nd ed., Morgan Kaufmann Publishers, San Francisco, California, 2000. -------------------------------------------------------------------------------- T-JOINS--PART 2 OF 2 Remember that the Theta-join (T-join) attempts to correlate two lists of values so that they attain a best fit when compared to each other. We'll look at an example that compares Class sizes with Room sizes and returns a list of the rooms that can be used for specific classes. The input Classes and Rooms tables are shown below: CLASSES ROOMS class_key class_size room_number room_size C4 55 R4 45 C5 50 R5 55 C6 40 R6 60 R1 42 R7 70 Using the above as input, the result from the T-Join is a single set showing in which rooms each class shall reside: class_key class_size room_number room_size C6 40 R1 42 C5 50 R5 55 C4 55 R6 60 You can see that the above list shows the most efficient room that can be used for a class. For example, class C4, with 55 people, used room R7, which seats 60, rather than Room R7, which seats 70. The SQL that creates this result is broken into two steps: Step 1: Create a temporary table and insert a working set of data. Insert into temp_table Select class_key, class_size, min(room_size) From Rooms R, Classes C Where c.class_size <= r.room_size Group By class_key, class_size; Step 2: Join the temporary table created above with the Rooms table. Select c.class_key, c.class_size, r.room_number, r.room_size From Temp_Table C, Rooms R Where c.room_size = r.room_size; This is only one of many ways to write T-joins. If you write a different one, make sure that the result set is one that is a best fit. The source for this tip is Joe Celko's excellent book, "SQL for Smarties: Advanced SQL Programming," 2nd ed., Morgan Kaufmann Publishers, San Francisco, California, 2000. -------------------------------------------------------------------------------- THE PROS AND CONS OF GENERATING ASCENDING KEYS There are different ways to generate new keys for a table. One of the most common ways is to generate an ascending surrogate key. This is sometimes referred to as a monotonically increasing key because it often increases by one, though it doesn't need to and can skip numbers. Some benefits of these ascending keys are: They are easy to generate. Databases often supply built-in methods for doing this, such as Oracle's sequences and SQL Servers Identity keyword. Developers don't need to worry about collisions since uniqueness is guaranteed. A disadvantage is that hot spots can be created in tables and indexes since the same part of the database may be accessed frequently. This can cause locking problems, and as a result, this solution does not scale well. This approach should be used for systems with low to medium transaction volumes. -------------------------------------------------------------------------------- THE DIFFICULTY OF IMPLEMENTING ONE-TO-ONE RELATIONSHIPS As a database designer, you'll inevitably come across a case of a one-to-one (1:1) relationship in the logical model where the relationship is correct and where it makes sense to implement the entities as separate tables. Some reasons why you may want to implement these as separate tables are: Each table has different relationships to and from them. The tables have entirely different uses and meanings. They each contain a lot of rows and columns. Each table in the 1:1 relationship has mutually exclusive mandatory attributes (columns). The relational model doesn't allow for a nice implementation of a 1:1 relationship. In addition, case tools don't support these relationships in an elegant manner. You must be careful to keep track of the way the relationship was implemented so that subsequent forward-engineering of the tables and relationship won't introduce new inconsistent relationships. -------------------------------------------------------------------------------- THE CALENDAR TABLE The Calendar table is a simple and powerful design option for your system. This table is designed to store all the important dates your system will need to know. The Calendar table will include such columns as Calendar_Date--The actual date--for example, 2000-01-31 Month--A two-digit month number from 01 to 12 Day--A number from 1 to 31 Year--A four-digit number Year_Month--A six-digit number, such as 200012 Year_Quarter--A number such as 200001 for Q1 year 2000 Day_Name; Month_Name Day_Of_Year; Day_Number_Of_Week; Week_Number_Of_Month; Week_Number_Of_Year Holiday_Flag; Weekday_Flag; Event_Name; Month_End_Flag; Financial_Month_End_Close_Flag; Fiscal_Period and Fiscal_Year You may have many more columns to add to this table. The Calendar table will contain one row for each day. It provides a simple way to query data for special dates, such as the financial_month_end_close_flag. This is especially valuable for querying datawarehouses and data marts. This Calendar table is an implementation of a deviation of the Time Dimension, as designed by Ralph Kimball. In our next tip, we'll discuss design options for the primary key of the Calendar table. -------------------------------------------------------------------------------- THE BOOLEAN DATATYPE SQL-3 has introduced a true Boolean datatype that has a True or False value. This will allow you to use Boolean functions (rather than predicates) and helps support user-defined operations. Here's an example that uses the Boolean datatype: Create Table Table1 ( column1 char(10), column2 boolean); Select * From Table1 Where column2 = TRUE; Notice that the value True can be used without including a flag value of 0/1 or Y/N. -------------------------------------------------------------------------------- TEXT VS. NUMBER KEYS When determining primary keys for databases and when using surrogate keys, we have the option of using text or numeric values. Typically, numeric values are used to take advantage of database-specific features, such as Oracle's Sequences and SQL Server's Identity. The advantage of numeric keys is that they are easy to increment and they don't take a large amount of space. There is one specific case, however, in which there's an advantage in creating surrogate keys using alphanumeric columns: datawarehouses. This allows us to use numbers (and Oracle's Sequences) and to cast the numeric type to the alphanumeric type. It also allows us to use the incoming system's primary key when populating the datawarehouse, thereby avoiding key translation in many cases. To anticipate different incoming system key types, we'd need to make the key a large Char or Varchar value to accommodate the many format differences that will be encountered. This approach is not often used, since it requires more space, so the space requirements must be offset against the simplicity of avoiding key translation when populating a datawarehouse. Please note that key translation can be avoided only if a single system is the source for a datawarehouse table. -------------------------------------------------------------------------------- TEXT SEARCHES WITH FREETEXTTABLE IN SQL SERVER Similar to Freetext, which can be used to return values from text searches in SQL Server, Freetexttable can be used to return a set of keys that match a text search. This set of keys can be used to perform a join to the table that is being searched: Select b.* From Books b JOIN FREETEXTTABLE(Books, book_name, ' "jurassic park" ') f On b.books_id = f.[key]; Freetexttable is a simple and handy feature to use in SQL Server, but it is proprietary and not included as an ANSI standard. -------------------------------------------------------------------------------- TEXT SEARCHES WITH FREETEXT IN SQL SERVER SQL Server has a proprietary function called Freetext that will send a string to a text-processing service that can compare the string and remove unnecessary characters from the conversion. We saw the Contains clause in a previous tip, but this is easier to use than Contains. This example finds an Author's name from a table of Books, using Freetext: Select author_name From Books Where FREETEXT(author_name, ' "joyce" '); The author_name will be returned for all books with a name of joyce as the author. -------------------------------------------------------------------------------- SYNONYMS, ALIASES AND VIEWS There may be some confusion over the terms synonym, alias, and view. Views are included in ANSI SQL as a way of joining data, filtering data, performing functions, simplifying table access, or delivering security. The other two options are not ANSI SQL features but are implemented in many databases. Synonyms are implemented in DB2 and Oracle. A synonym can be used only by the synonym creator--except in the case of public synonyms when used in Oracle. This is a useful feature in a development environment when used to redirect a user's query to another user's table. (Note that this feature can also be performed with views.) DB2 provides another object, called an alias, which is used mostly for distributed processing. An alias can be created in DB2 on remote tables to give tables shorter names since the location information is no longer needed. This alias is created with Create Alias... syntax. It differs from another form of alias used in SQL statements, as shown below: Select · from employee emp Where emp.employee_id = 12345; In this case, the alias emp for the employee table is created only for this SQL statement. This form of an alias is ANSI compliant and implemented in all major RDBMSs. -------------------------------------------------------------------------------- SURROGATE KEYS When designing your system, always use surrogate keys for the primary key. These are meaningless, dataless keys and may be thought of as being similar to object identifiers (OIDs) in the Object-Oriented Model. Here are the advantages: The ID always uniquely identifies a row, no matter which column value changes on the row. When updating a column value, the update does not need to propagate to other tables to keep foreign keys in place. Every row has a unique identity. When other columns are used to identify a row, they may change and may result in duplicate values. This is not to say that a row cannot have alternate keys that can be used to find a row. However, these should not be used as the primary key. An example of a surrogate key is the case of creating a Customer_Id column to uniquely identify a row on the Customer table rather than using the Customer_Name column. -------------------------------------------------------------------------------- SUMMING AND GROUPING You can perform functions on sets of data by using the Group By feature. This allows you to perform such functions as Sum, Count, and Avg and to aggregate your results at the grouping level. To illustrate, let's look at an example where we query a Sales table, as follows: Select region, territory, sales_rep_id, sales_dollars From sales group by region, territory; The results may look like this: REGION TERRITORY SALES_REP_ID TOTAL_SALES ------ --------- ------------ ------------ EAST 1 50 750.00 EAST 1 55 750.00 EAST 2 20 1100.00 EAST 2 27 900.00 WEST 3 33 1700.00 WEST 3 91 1300.00 WEST 4 11 325.00 WEST 4 23 175.00 If you decide that you want to analyze the sales by a grouping where the sales figures are summed by Region and Territory, you could rewrite the above query as follows: Select region, territory, sum(sales_dollars) TOTAL_SALES From sales group by region, territory; The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 WEST 3 3000.00 WEST 4 500.00 In the next few tips, we'll enhance this by summarizing the above to additional levels. -------------------------------------------------------------------------------- SQL TO PERFORM MEDIAN A median function doesn't exist in SQL, but you can create your own in several ways. This isn't a simple task, but there are many different solutions from SQL experts--C. Date and J. Celko, among others. This tip will include a solution of Date's second median based on Celko's first median. Given the following Emp_Salary table, find the median salary: Person_Id Salary 5 $1,000,000 6 $ 100,000 7 $ 60,000 8 $ 40,000 Select avg(DISTINCT E.salary) AS median From Emp_Salary E Where E.salary in (Select MIN(e.salary) FROM emp_salary e -- NOTE 1 Where e.salary in (Select E2.salary FROM Emp_Salary E1, Emp_Salary E2 WHERE E2.salary <= E1.salary ) GROUP BY E2.salary HAVING count(*) <= -- NOTE 2 (Select CEILING(count(*) /2) FROM Emp_Salary ) ) UNION (Select MAX(e.salary) FROM emp_salary e -- NOTE 3 Where e.salary in (Select E2.salary FROM Emp_Salary E1, Emp_Salary E2 WHERE E2.salary >= E1.salary ) GROUP BY E2.salary HAVING count(*) <= -- NOTE 4 (Select CEILING(count(*) /2) FROM Emp_Salary ) ); NOTE 1) Find the lowest value in the top half of the result rows. NOTE 2) Get rows that make up the top half of the result set values. NOTE 3) Find the highest value in the bottom half of the result rows. NOTE 4) Get rows that make up the bottom half of the result set values. If the values contain nulls, the nulls should be dealt with to ensure accurate results. The source for this tip is Joe Celko's highly recommended book, "SQL for Smarties: Advanced SQL Programming." -------------------------------------------------------------------------------- SQL SERVER WITH NULLS IN ROLLUP AND CUBE As with Oracle, SQL Server provides the capability to distinguish the difference between a Null value that is the result of a Rollup taking place because of a break in the grouping, and a natural Null value that is returned from the queried table. SQL Server uses the grouping value of 1 to designate a Null value that results from a break. You can use this in conjunction with the Case statement to write a query as shown below: *Select CASE WHEN (grouping(region) = 1) THEN 'TOTAL COMPANY' ELSE ISNULL(region, '?') END as 'REGION', CASE WHEN (grouping(territory) = 1) THEN 'Total Region') ELSE ISNULL(territory, '?') END as 'TERRITORY', sum(sales_dollars) as 'TOTAL_SALES' From sales Group By region, territory with Rollup; The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 EAST Total Region 3500.00 WEST 3 3000.00 WEST 4 500.00 WEST Total Region 3500.00 TOTAL COMPANY 7000.00 -------------------------------------------------------------------------------- SORTING WITH NON-DEFAULT COLLATION SQL-92 only allows us to sort data using the default collation that is defined for a specific character set. Therefore, all character data will be sorted in the same way for the character set that was implemented. The same is true for numeric data or any other native character set implemented by the DBMS. This has changed in SQL3 with the introduction of User-Defined collating schemes. Accordingly, the following is valid in SQL3 (but not in SQL-92): Select column_name1, column_name2 From Table_Name1 Order By schema_name1.new_collation; Tim Quinlan is a database consultant in the Toronto area specializing in relational data warehousing and OLTP architecture, design, and implementation. -------------------------------------------------------------------------------- SORTING NULL VALUES When sorting data, where do nulls sort? Well, it depends on the DBMS and the character set that you've chosen in your implementation. The implementation of nulls will, however, be consistent. In other words, if they are to be greater than non-null values, they'll sort to the end when ascending, and they'll sort to the beginning when descending. This kind of makes sense, doesn't it? Tim Quinlan is a database consultant in the Toronto area specializing in relational data warehousing and OLTP architecture, design, and implementation. -------------------------------------------------------------------------------- SORTING BY AN UNNAMED VALUE EXPRESSION In SQL-92, sorts had to be done on named expressions. In SQL3, a sort can be performed on an unnamed value expression, as long as it is in the query's Select list. An example of this is shown below: Select inv_amt * 1.15 From Invoices Order By inv_amt * 1.15; This statement could not have been written using SQL-92 syntax and would have had to be rewritten using a column alias. Tim Quinlan is a database consultant in the Toronto area specializing in relational data warehousing and OLTP architecture, design, and implementation. -------------------------------------------------------------------------------- SORTING ASCENDING AND DESCENDING The default method for sorting data is to sort in ascending order. If the ordering is not listed, the column will always be sorted in ascending order. Take a look at the following statement: Select column1, column2, column3, column4 From Table1 Order by column1 desc, column2, column3 desc, column4; In the above example, both column1 and column3 will be sorted in descending sequence while column2 and column4 will be sorted in ascending sequence. Notice that specifying Desc in column1 does not automatically become the default ordering for the next column. Tim Quinlan is a database consultant in the Toronto area specializing in relational data warehousing and OLTP architecture, design, and implementation. -------------------------------------------------------------------------------- SOME REASONS TO INDEX When choosing whether to create an index for a given set of columns, you should keep a few points in mind. While not complete, here's a quick checklist: Index column(s) that support a primary or unique constraint. Index columns used frequently in Where clauses as predicates. Create an index when it can be used to limit the number of rows to a small percentage of a table's size. A very rough estimate for this is that the index should narrow the search to less than 20 percent of the table size. Create indexes when the table is relatively large. Indexes can be used to order data while avoiding sorts. Indexes can be used to cover all columns of a search so that table data is not needed. In our next tip, we'll look at times when you don't want to create an index. -------------------------------------------------------------------------------- SLOW-RUNNING QUERIES Say you've been running the same query or program for months. Suddenly, it runs very slowly, and you haven't changed the SQL or the program. You haven't even recompiled or re-optimized the queries. What has caused this remarkable phenomenon? Some possible causes are: An index has been dropped. An index has been added and the SQL was dynamically recompiled. Statistics were run on the affected tables or indexes. Statistics were dropped from the affected tables or indexes. Either a patch or new release of the database was implemented. Environment variables may have changed. (Perhaps buffer/cache sizes have been decreased, for example.) These types of problems are insidious and difficult to debug, but they occur more frequently than you might expect. Don't be surprised--just take your time and look for changes that may have taken place. The above list isn't complete, but it gives you an idea of what kinds of things to look for. -------------------------------------------------------------------------------- SINGLETON SELECTS IN EMBEDDED SQL In our previous tip's sample program, we had an example of a singleton Select statement. This is a statement that returns no rows, or one row. It cannot return more than one row or a SQL error will occur, since the host variables in the program will not be able to deal with the multiple rows. The sample code is main() { exec sql include sqlca; exec sql begin declare section; int vCustomerNo; char vCustomerName[30]; exec sql end declare section; whenever sqlerror goto error_rtn; vCustomerNo = 1; exec sql select customer_name from Customer where customer_number = :vCustomerNo into :vCustomerName; error_rtn: printf("SQL Error"); exit(); } Notice that the Select statement is the same as one you would code in your favorite SQL editor with the exception of the exec sql and into keywords. Since customer_name is returned by the SQL statement into a single variable, vCustomerName, only one row can be returned. This means that the Where clause must use a unique index. In this case, a unique or primary constraint should be created on the customer_number column. -------------------------------------------------------------------------------- SINGLETON SELECT VS. CURSOR If a SQL statement in your system will return only one row--and if you're certain that only one row or no rows will be returned because you're accessing the data using a column that is unique in the where clause--then a cursor and a singleton select will both work. The single select will almost always outperform the cursor since the cursor requires an open, fetch, and close while the select statement requires only the single select clause. Therefore, you should favor the singleton select. The one exception to this is the case where the row selected must later be updated. In these cases, use a cursor with the for update of clause to ensure that the row is not changed between the select and the subsequent update. Of course, if lock time is a factor, you may not use the for update of clause. But if you take this approach, make sure that the row was not changed by another transaction between the select and the update statements. -------------------------------------------------------------------------------- SENSITIVE CURSORS Sensitive cursors are a new feature in SQL-3 and are defined with the Sensitive keyword. A cursor defined using this option can access the actual query result rather than a copy of the data that may have been created by the cursor statement. In other words, this operates on the actual table rather than on a copy of the table. This means that updates made in the same transaction can be viewed through subsequent cursor fetches. Asensitive cursors are the default case and may or may not take a copy of the result table. Whether changes to the result table will be visible to the cursor will depend on the DBMS and the implementation. Here's an example: Declare Customer_Cursor CURSOR For Sensitive Select * from Customer; -------------------------------------------------------------------------------- SELECT TOP IN SQL SERVER SQL Server has added a proprietary feature that allows you to restrict the number of rows that your query returns: the Top keyword in a query. You can specify an absolute number of rows with Select Top n. Or you can specify a percentage of rows from a table, or result set, by using Select Top n Percent. An example: Select TOP 20 * From Customers; Note that this does not imply any ordering and gets only the first 20 rows seen by the query. You would need to add an Order By clause to get a specified number of rows in an ordered fashion. -------------------------------------------------------------------------------- SCALAR FUNCTIONS You often see in database and SQL documentation various functions that are supported by your DBMS and by the SQL standard. In this documentation, you may have run across the expression scalar functions and wondered, "What's that?" A scalar function is one that is applied to a single column (or expression) and that operates on a single value. These return a row with one field/column. Examples of scalar functions are char, concat, substr, round, rtrim, and ltrim. Scalar functions differ from column functions that operate on a set of data. -------------------------------------------------------------------------------- RUTHLESSLY DROPPING COLUMNS After reading the previous two tips, you've seen that it is possible to drop columns from tables and that you can control this function so it only occurs when no other database objects exist on the table. Now, what if you want to drop the column regardless of whether other objects use that column? This is common in a development environment, or in controlled situations where you know exactly what will be the impact of the drop action. This function can be performed using the Cascade keyword with the drop column statement. Here's an example: Alter Table Product drop column product_short_name Cascade; This will drop or disable any object that references the dropped column. -------------------------------------------------------------------------------- ROWID AND RID IN DB2 In our previous tip, you saw that the Rowid datatype can be used to create a column on a DB2 Table. You may have also heard your DBA refer to a DB2 Rid and wondered if they're the same thing. Though they may conceptually be considered similar, they are in fact different things. A Rowid is used to create a column with a direct pointer to a row using the Rowid datatype, whereas the Rid is a DB2 internal record ID that is a 4-byte identifier showing the offset within a page where a row can be found (this is 5 bytes for partitioned tablespaces). Rids are used internally in DB2 structures such as the DB2 log. -------------------------------------------------------------------------------- RENAMING A TABLE IN DB2 The ANSI SQL3 (SQL99) standard does not include the capability to rename a table, but IBM's DB2 database does provide this capability. The syntax is Rename Table source_table to new_table; Rename Table Cust to Customer; The source_name can include the schema name while the target_name does not. This is a simple and quick way to rename a table in the DB2 database. (A previous tip discussed the capability of SQL Server to rename a Table using the sp_rename stored procedure.) -------------------------------------------------------------------------------- RENAMING A TABLE The ANSI SQL3 standard does not provide syntax to rename a table. Some databases, such as Sybase's Adaptive Server and Microsoft's SQL Server, have proprietary features that perform table renames. The feature they both use is the stored procedure sp_rename. Here's an example of a table rename using this procedure: exec sp_rename Product, Item In other databases you may need to use an option such as create table...as select....from table... and then drop the original table. In some cases, a view, synonym, or alias may also solve your problem. -------------------------------------------------------------------------------- RENAMING A TABLE In Microsoft SQL Server, you have the option of renaming a table with a simple SQL statement. This is performed using the supplied stored procedure sp_rename, as shown here: exec sp_rename Customer, Customers This can also be easily done in SQL Servers Enterprise Manager. Most other databases do not provide this capability since it is not an ANSI standard--in which case either synonyms, views, or aliases must be used to simulate the rename. To do the rename properly, the table would need to be created with the new name, and the data would then have to be moved therein. -------------------------------------------------------------------------------- RENAMING A COLUMN You may be surprised to learn that, in general, DBMSs do not allow us to rename columns. The column rename is not included in the SQL3 standard and for this reason is not included in most major DBMSs. It should be noted, however, that this is an easy limitation to work around. To rename a column, a new column (with the new name) must be added to the table, the data copied from the old column to the new column, and the old column then dropped. Another approach is to create a view on top of the table and deal with the column rename by dropping and re-creating the view with the new column name. -------------------------------------------------------------------------------- REMOVING FOREIGN KEY CONSTRAINTS To truncate or delete many rows from a table, without cascading the deletes, foreign keys need to be removed. To do this, simply alter the foreign key table(s) and disable the constraints (if your DBMS provides this feature). If it does not provide the alter...disable constraint feature, you will need to drop the referential constraint. Check that the table(s) with the foreign key constraints are empty or at the very least are not affected by the operation you are about to perform. After doing this, truncate the other table and then re-enable the constraints. I prefer disabling the constraints over dropping them because the constraint information will be retained in the database catalog and not lost. These features may not all be implemented in your DBMS of choice, but similar approaches should be used. When doing these types of operations that affect a large part of the database, you should consider backing up the database--or at least the portion of the database being affected--before beginning. -------------------------------------------------------------------------------- REMOVING DUPLICATES In the past, I've discussed an approach to removing duplicate rows from a table using subqueries and deleting rows using a rowid. Today, I'll discuss another approach that uses a series of queries to get rid of duplicate rows. Let's assume that we have duplicates in the Customer table. We could perform the steps shown below: Create a new temporary table Create Table Customer2 (customer_id integer, customer_name char(50)); Insert unique rows into the new temporary table Insert into Customer2 Select Distinct customer_id, customer_name from Customer; Remove the data from the original table Delete from Customer; Move the data from the temporary table to Customer Insert into Customer select * from Customer2; Get rid of the temporary table Drop table Customer2; The above solution is a very rudimentary one that does not take into consideration referential integrity. It is, however, simple and works well in certain conditions--including development and test environments where you're constantly looking for quick ways of doing things. -------------------------------------------------------------------------------- RECURSIVE SQL IN DB2 AND SUMMARIZED EXPLOSION In our previous tip, we looked at an example of a bill-of-materials implementation using common table expressions. In the example, we showed a single-level explosion, where a part and subpart were shown only once. In today's tip, I will again borrow from an example in IBM's SQL Reference "Appendix M. Recursion Example": We will look at a summarized explosion where we want to know how many of each subpart are required to build the highest-level part, which is part '01'. To do this, we need to multiply the quantity per parent in the subquery of the RPL (recursive partlist) common table expression. We then sum the parts in the second Select statement to get the total of a subpart for the given part. The table, common table expression, and final Select statement are shown below: The table is defined as follows: CREATE TABLE PARTLIST (PART VARCHAR(8), SUBPART VARCHAR(8), QUANTITY INTEGER); The common table expression is: WITH RPL (PART, SUBPART, QUANTITY) AS ( SELECT ROOT.PART, ROOT.SUBPART, ROOT.QUANTITY FROM PARTLIST ROOT WHERE ROOT.PART = '01' UNION ALL SELECT PARENT.PART, CHILD.SUBPART, PARENT.QUANTITY*CHILD.QUANTITY FROM RPL PARENT, PARTLIST CHILD WHERE PARENT.SUBPART = CHILD.PART ) SELECT PART, SUBPART, sum(QUANTITY) FROM RPL GROUP BY PART, SUBPART ORDER BY PART, SUBPART; -------------------------------------------------------------------------------- RECURSIVE SQL IN DB2 AND SINGLE-LEVEL EXPLOSION DB2 supports recursive common table expressions that can be used to implement a classic bill-of-materials parts explosion. I like using DB2 examples because they provide some insight into the ANSI SQL standards. To illustrate this, I will borrow an example from IBM's SQL Reference "Appendix M. Recursion Example." This uses the example of a Partlist table shown below: CREATE TABLE PARTLIST (PART VARCHAR(8), SUBPART VARCHAR(8), QUANTITY INTEGER); To explode this table into the parts and subparts, the following query will be used: WITH RPL (PART, SUBPART, QUANTITY) AS ( SELECT ROOT.PART, ROOT.SUBPART, ROOT.QUANTITY FROM PARTLIST ROOT WHERE ROOT.PART = '01' UNION ALL SELECT CHILD.PART, CHILD.SUBPART, CHILD.QUANTITY FROM RPL PARENT, PARTLIST CHILD WHERE PARENT.SUBPART = CHILD.PART ) SELECT DISTINCT PART, SUBPART, QUANTITY FROM RPL ORDER BY PART, SUBPART, QUANTITY; The above query uses a common table expression, called RPL (recursive partlist) in this case. The first part of the query retrieves the direct children of Part '01'. (In effect, when performing a recursive query, you need to start somewhere, and this is where we will begin this one.) The remainder of the query in the subselect joins the subselect rows to those already selected by referring to the query name RPL and using that as the parent that is then joined to the child. In other words, once a row is returned in RPL, it is used as a parent and then used to join to the child row. This recursive query can then be used in a Select statement by its name, RPL, as shown by the second SQL statement above. This type of statement is very difficult to understand, and may need to be studied and used to fully appreciate how it is working. It is certainly not one of SQL's more intuitive statements. The Select Distinct in the second query ensures that the same part and subpart is listed only once. -------------------------------------------------------------------------------- RANKING VALUES IN A LIST Let's say you have a list of values that you want to rank from 1:n in ascending order. What's the best way to do this? Let's look at a modified example from Joe Celko. This example takes data from this Sales table: SALES sales_rep territory total_sales Jones 1 345 Smith 1 345 Lee 1 200 Simpson 1 990 Given the above list, we want to achieve the following result: sales_rep territory total_sales rank Simpson 1 990 1 Jones 1 345 2 Smith 1 345 2 Lee 1 200 3 The SQL we can use to perform this is Select s1.sales_rep, s1.territory, s1.total_sales, (Select count (Distinct total_sales) From Sales as S2 Where s2.total_sales >= s1.total_sales AND s2.territory = s1.territory) as rank From Sales as s1; Note the use of a Select statement embedded in the query to generate the Rank column. The source for this tip is Joe Celko's excellent book, "SQL for Smarties: Advanced SQL Programming," 2nd ed., Morgan Kaufmann Publishers, San Francisco, California, 2000. -------------------------------------------------------------------------------- QUOTES INSIDE STRINGS It can be tricky dealing with quotes that are embedded within strings. For example, how would you code "Dave's here" in a SQL statement? Different DBMSs have their own solutions using double quotes, multiple quotes, and escape clauses. The ANSI standard is to use single quotes and a double quote for the "real" quote in the string. For example, the SQL statement with embedded quotes for the above string could be written as follows: Select column_name From Table_Name Where column_name = 'Dave''s here'; -------------------------------------------------------------------------------- QUERY TOO COMPLICATED? CONSIDER VIEWS In our previous tip, we looked at using temporary tables to simplify complicated queries. Another approach is to create views to resolve one part of the query and to use that view within the query. In some extreme cases, you can also create views on views. Some things to consider when doing this are: You can join tables and views within each view and can also apply functions in them. This approach can simplify complex queries. Only selects can be performed through a join view. There are some situations where nested functions (e.g. Select function1(function2(...))) cause problems that can be resolved by performing one function in a view to create a virtual column and having the second function created on the view column. The entire query, including the views query, will be resolved as one by the optimizer. The above can be a useful approach in many cases, but be aware that view maintenance in the database can become an issue if this approach is overused. Also, keep in mind that this approach isn't necessary in many cases. It's easier to understand a query if all of the logic exists in a query (as opposed to being hidden in views). -------------------------------------------------------------------------------- QUERY TOO COMPLICATED? CONSIDER TEMPORARY TABLES You may run across situations where you have to figure out a way to write a very complicated query. In some cases, resolving part of the query and writing data to a temporary table (or tables) and then joining that table to other tables will solve your problem. When using this approach, keep the following points in mind: Overhead is involved with creating and storing temporary data. This isn't a good approach for high transaction volume OLTP (online transaction processing) systems. This is useful for one-time SQL or for background tasks that aren't frequently executed. When creating temporary tables, use the Temporary option to reduce the overhead incurred when creating normal (non-temporary) tables. This is a very simple solution that is perhaps not the most efficient but that can simplify logic for infrequently run queries. This is a useful approach to use when multiple, different groupings of data need to be applied. -------------------------------------------------------------------------------- QUALIFYING TABLES IN SQL SERVER A fully qualified table name in SQL Server is a three-part name that includes dbname + owner + table_name Some examples are Test_db.user1.customer Test_db.dbo.customer Dev_db.dbo.customer If you log into Test_db as user1 and perform the SQL statement Select * from Customer; you will access the User1.Customer table. If, however, you log in as user2 and perform the same SQL statement, you will access the dbo.customer_table (if you had been granted access to it). Therefore, an approach that allows SQL Server to access tables in a more portable manner would include implementing the tables with an owner ID like dbo thereby having your program pick up this table by default. -------------------------------------------------------------------------------- PUBLIC SYNONYMS IN ORACLE Tables in Oracle are qualified by the Oracle system ID (SID) and the table owner. Consider the following tables in the 'Prod' Oracle SID: Owner1.customer Test1.customer If you are logged in as Owner1 and perform the SQL statement Select * from Customer; you will retrieve rows from table Owner1.Customer. Similarly, if you do the same with user Test1, you will retrieve rows from Test1.Customer. If, however, you log in as Userx and perform the same SQL statement, you will not access any tables unless Userx has been granted access and a public synonym or private synonym, or a view has been created on the table. The advantage of creating a public synonym is that any user that is granted access to a table (regardless of the owner ID) can see the table. For example, in the following situation: Create Public Synonym Customers For table Owner1.Customers; Grant Select on Owner1.Customers to Public; any user selecting data from Customer will be able to access this table. -------------------------------------------------------------------------------- PROS AND CONS OF GENERATING RANDOM UNIQUE KEYS In our previous tip, you saw how to generate random unique keys. In today's tip, we'll look into the pros and cons of using this approach. The benefits of random unique keys are: Data can be distributed relatively evenly across a table. Data contention and locking is reduced. The performance of a function that accesses pseudo-columns in memory can be better than one that must access data. These functions are relatively simple to write. Disadvantages of random unique keys are: The key size can become quite large. The algorithm must be a good one and must be well tested, or it may not perform well or may create system problems. These keys don't guarantee uniqueness. All programs need access to the algorithm. Use randomly generated keys when the table using this key is large, the transaction volumes are high, there are many concurrent users, the database doesn't support an efficient key sequence, and the database supports a simple and efficient pseudo-column with which to generate the new key. -------------------------------------------------------------------------------- PROCESSING A SQL STATEMENT When processing a SQL statement, the database engine goes through several steps. Consider the following statement: Select * from Customers; To process this statement, the DBMS executes the following steps: 1) Parse the statement, breaking it into its various parts and checking syntax. 2) Validate the statement and ensure that all objects exist and are valid. 3) Optimize the statement and determine the best access path. 4) Generate an application plan. This is in effect the code that executes the query. 5) Execute the application plan. As you can see, that simple SQL statement you're running is going through a lot of processing before any data is ever accessed. -------------------------------------------------------------------------------- PROBLEMS WITH TRIGGERS Database triggers are a very handy feature that allow us to easily perform activities in a manner that is transparent to the application system. Along with other things, they allow us to Implement referential integrity constraints that are not supported by native DBMS declarative R.I. Provide logging or audit capabilities for a table Implement edit rules before changing data Triggers are applied at the table level and cannot be bypassed by application logic. As you can see, they have tremendous advantages. However, some potential problems can occur as a result of triggers. The most serious is when trigger code is not activated because the trigger has become inactive. In other words, a trigger may become disabled or may be inadvertently dropped. These problems are insidious because the system may continue to operate without the trigger code being executed. It is important, therefore, to ensure through regular system checks that your triggers are all valid and active. -------------------------------------------------------------------------------- PROBLEMS WITH CONCATENATED KEYS It's possible to implement your database with surrogate keys and for the keys to include more than one column. An example of this is a situation where we have a Party table with a primary key of Party_Id (which is a meaningless number), or a surrogate key. The Customer table is related to Party. Customer has a Customer_Id column and Party_Id as the foreign key. We could implement the Customer primary key to be Customer_Id alone, or we could choose to make the key a concatenated key of Customer_Id and Party_Id, according to the relationship to Party. Let's assume we've chosen the latter. This means that both columns will be needed to make a row unique. There is no single identifier to the Customer table. The effect of this becomes more profound as we go down one level. Let's assume that Customer has a one:many relationship to the Customer_Address table. The foreign key to Customer_Address does not include Customer_Id and Party_Id; these will be concatenated to Customer_Address_Id to become the new primary key for Customer_Address. Some negative impacts of this include: The Customer_Address key includes three columns, thereby taking more space. Uniquely identifying a row on Customer_Address requires three columns. Relationships from Customer_Address will carry a three-column overhead to the next table and result in a four-column key. Index maintenance will become more complex and will incur greater overhead. The SQL required to join tables can become more complex since more columns are required. Despite the above problems, in some situations concatenated keys are a benefit. We'll see why next time. -------------------------------------------------------------------------------- PROBLEMS DISPLAYING DATA Have you run across situations where you're querying data and getting no results--even though you're certain that there is data available? To make sure that you aren't going crazy (naturally, that's always one possibility), you can perform a series of troubleshooting steps similar to those I recently had to perform with a SQL statement similar to the following: Select column_name from table where table_id = 12345; This returned 1 row and APPEARED to return a blank column_name. To make sure that this was what I had received, I then ran the following query: Select count(*) From table_name Where table_id = 12345 and column_name = ' '; This returned no rows, so I then assumed that the column_name must be null. I tested this hypothesis with the following: Select count(*) From table_name Where table_id = 12345 and column_name is null; Again I got no rows. The column is neither blank nor is it null. I had to make sure that this was true so I ran this statement: Select count(*) From table_name Where table_id = 12345 and column_name is not null; Of course, this returned a single row. To further analyze the row, I issued a statement to see how long the column column_name was. (I wanted to make sure that there weren't multiple characters that were causing me problems.) Select length(colname) From table_name Where table_id = 12345; This returned a single value of 1. To finish this off, I finally chose to look at the ASCII value of the column and ran the following SQL: Select ascii(colname) From table_name Where table_id = 12345; In my case, it returned a value of 0, which is not a null value in my database but may have been in a tool or system that was accessing it. Of course, these steps weren't all necessary, but when you're getting results that you can't believe, you sometimes need to circle in on the problem. The SQL features outlined above may help you out of a similar jam. -------------------------------------------------------------------------------- PRIMARY KEY OF THE CALENDAR TABLE There are two options for the Calendar table primary key. The first is that Calendar_Id will be a surrogate (meaningless) key with a numeric value. This has the advantage of taking up less space when implemented as a foreign key on large fact (transactional) tables. The problem with this approach is that a join needs to be performed to the Calendar table every time a date is needed. The second approach is to implement the key as a Date column so that Calendar_Id is the actual value of the date. This may take more space on a large fact, but it may also mean that (since the date already exists on the table) a join does not need to be performed unless special information is needed. I've seen the second option successfully implemented. If space is not an issue for you, use the second option. Otherwise, look into the first, but weigh the space issue against the cost of additional joins. -------------------------------------------------------------------------------- PREPARING A TABLE TO LOAD When loading a large table, you need to prepare for the load. You want to make sure that the load is being performed in the most efficient way possible. This means performing a Direct Load in Oracle, using Log No in DB2, and fast bcp with SQL Server/Adaptive Server (exec sp_dboption 'select into/bulkcopy', TRUE). Indexes on the table and triggers may further affect your decision. You'll want to disable triggers to speed performance, since in some cases they will either not be executed or may slow performance. Indexes may also slow the load, so you may want to drop the indexes and re-create them using a parallel index rebuild after the table has been loaded. This is not an exact science, and you should test various approaches before settling on one. I have found cases where it was faster to keep indexes on the table and run the "slower" (non-direct, non-fast) load than it was to run the fast load and be faced with re-creating the indexes. Primary factors in this decision include the amount of memory available, the number of CPUs on the system, and the physical disk drives that affected the amount of parallelism that could be achieved in data loads and index creation. Test your approach and remember that one solution will not suit all situations. -------------------------------------------------------------------------------- PERFORMING NAME SEARCHES--PART 1 OF 3: THE SOUNDEX ALGORITHM You'll come across many situations where you'd like to perform name searches but don't know the exact spelling. There's a way to do this in SQL using a Soundex algorithm. This method takes a person's name as input and produces a resulting value that can be matched with other names. The Soundex algorithm isn't found in ANSI-SQL but is implemented in some databases. Variations on this algorithm are available in Oracle and SQL Anywhere, among others. Interestingly, the basic algorithm was developed before computers, by O'Dell and Russell in 1918. In our next tip, you'll see the original Soundex algorithm--which you can use to develop your own function in case your favorite DBMS does not have one. -------------------------------------------------------------------------------- PERFORMING NAME SEARCHES--PART 2 OF 3: THE ORIGINAL SOUNDEX ALGORITHM As you saw in our previous tip, you can pass a name to the Soundex function and have a resulting value returned that allows phonetically similar names to be compared. The original algorithm (O'Dell and Russell, 1918) includes steps similar to those shown below. Let's use the name Quinlan as an example: 1) Capitalize all letters in the name. = QUINLAN 2) Keep the first letter. = QUINLAN 3) Drop all of the following letters after the first letter: A,E,H,I,O,U,W,Y = QNLN 4) Change all letters from the following sets into numbers: 1 = B,F,P,V 2 = C,G,J,K,Q,S,X,Z 3 = D,T 4 = L 5 = M,N 6 = R = Q545 5) Get rid of all consecutive pairs of duplicate digits from the string that resulted from step 4. = Q545 6) Pad the string from step 5 with trailing zeros and return only the first four positions. = Q545 You can then compare the value Q545 to all other names that have the value Q545. The source for this tip is Joe Celko's excellent book, "SQL for Smarties: Advanced SQL Programming," 2nd ed., Morgan Kaufmann Publishers, San Francisco, California, 2000. -------------------------------------------------------------------------------- PERFORMING NAME SEARCHES--PART 3 OF 3: AN IMPROVED SOUNDEX ALGORITHM In our previous tip, you saw the original Soundex tip. Today, let's look at an improved one from Joe Celko's excellent book, "SQL for Smarties: Advanced SQL Programming." Use the name Knudsen as an example and perform the following steps: 1) Capitalize all letters in the word. = KNUDSEN 2) Replace all non-leading vowels with A. = KNADSAN 3) Change the prefixes as shown below: FROM--TO MAC--MCC KN--NN K--C PF--FF SCH--SSS PF--FF = NNADSAN 4) Change letters after the first as shown below: FROM--TO DG--GG CAAN--TAAN D--T NST--NSS AV--AF Q--G Z--S M--N KN--NN K--C = NNATSAN Replace H with A unless A comes before and after H. Change AW to A. Change PF to FF. Change SCH to SSS. = NNATSAN 5) Remove all trailing A and S characters and pad with blanks. Replace trailing NT with TT. Remove all A characters except for a leading A. Pad with trailing blanks. = NNTSN Remove all but the first of repeating character substrings: = NTSN 6) Take the first four characters of the result string. = NTNS You can use this to help develop your own Soundex algorithm in case your database doesn't provide one. Many such algorithms are available, and if you do need to create one, I encourage you to research and test out the different approaches. -------------------------------------------------------------------------------- OPERATOR ORDER OF PRECEDENCE When you're analyzing SQL, it is important to understand the order of precedence of both Boolean and mathematical operations. For example, the statement 3 + 2 * 2 results in 7 (rather than 10) since multiplication is performed before addition. The order of precedence in SQL is as follows: 1) Operations contained in parenthesis: () 2) Multiplication, Division operators: *,/,% 3) Other mathematical operators: +,- 4) Boolean operators: AND,NOT,OR Even when they're not strictly necessary, it's always a good idea to use parentheses to make your operations easier to understand--and to ensure that they do what you want them to. For example, I would code the statement 3 + 2 · 2 as follows: (3 + (2 * 2)) This is simple math but becomes increasingly important as your SQL statements increase in complexity. -------------------------------------------------------------------------------- ONE-TO-ONE RELATIONSHIPS--WHAT YOU DON'T WANT When implementing a 1:1 relationship between two tables, you don't want to implement a foreign key on each table along with the primary key. In the case of Party and Customer with a 1:1 relationship and Party_Id and Customer_Id are the primary keys on each respective table, you don't want to add a Customer_Id foreign key to Party AND a Party_Id foreign key to Customer. This adds unnecessary system overhead and makes programming difficult. I've seen this implemented by case tools where the foreign key was added to one table on one forward-engineer of the tables and the opposite foreign key was added in a later forward-engineer of the same tables. When doing this work, know where you want the foreign key to be implemented and check the results after forward-engineering your database from the logical to the physical models. -------------------------------------------------------------------------------- ONE-TO-ONE RELATIONSHIPS--AN EXAMPLE When implementing a one-to-one (1:1) relationship between two tables, you're forced into one of four implementation options (unless you want to get really creative and add more). Taking the example of a 1:1 relationship between Party and Customer, where one Party may be 0 or one Customer and one Customer must be a Party, the options are as follows: Add a Party_Id to the Customer table as a foreign key to Party and add a unique index to this column to ensure that you don't get duplicates. This makes sense since Customer must be associated to a Party but a Party doesn't need to be a Customer. Add Customer_Id to the Party table as a foreign key to Party and add a unique index on this column to ensure that you don't get duplicates. This does not make as much sense since Party doesn't need a Customer but Customer needs Party. Some DBMSs will allow the unique index on the null entries and will allow multiple null values while others may not. Oracle will allow this if the entire foreign key is null since it won't index null values; however, if one column in the foreign key is not null, in the case of a multicolumn key, you'll be allowed only a single null value. I wouldn't put the foreign key on Party unless the Party must have a Customer. Make the Party_ID the primary key on Customer. The relationship and primary keys on both tables will enforce the integrity of these columns. This is the cleanest implementation with the least amount of overhead. One consideration to this is that your naming standards for foreign key and primary key columns may need to be compromised. Convince your data modeler that this is a worthwhile compromise. Create a new table to resolve the 1:1 and call it Party_Customer. This will consist of two columns--Party_Id and Customer_Id. Each column will have a unique index on it and will be a foreign key to the Party and Customer tables, respectively. You may need to do this if the relationships are optional, you aren't able to create unique indexes on the null entries, and you're uncomfortable implementing integrity with triggers. This should be used as a last resort since the database overhead is relatively high and the programming required to maintain data integrity is onerous. -------------------------------------------------------------------------------- NAMED SAVEPOINTS Within a transaction, you can mark certain places in your code to be a Savepoint. You can create Savepoints with a Name and use them to indicate a rollback position. This feature therefore allows us to roll back a part of a transaction without losing all of our work. These Savepoints are given a Name, which is used in the Rollback statements to distinguish the appropriate rollback position. This provides the flexibility to begin a transaction and rollback to different Savepoints. For example, given this following pseudo-code: Begin Transaction application code part 1 Savepoint save1 application code part 2 End Transaction you now have the option of committing the entire transaction, rolling back the entire transaction, or rolling back part of the transaction to Savepoint save1. The named Savepoint can also be released. To do this, you can issue the statement Release Savepoint savepoint_name; Of course, once you do, the program won't be able to roll back to that Savepoint. -------------------------------------------------------------------------------- MULTIPLE INDEXES VERSUS SINGLE MULTICOLUMN INDEXES In some situations, you'll want many columns on a table to be indexed. How do you decide whether to build a single multicolumn index or to create many single-column indexes? When faced with this decision, keep in mind the following: Does your DBMS support multi-index joins (index merge)? In other words, can a single table be accessed by merging more than one index in the same query? If so, then you have many options to consider when building your indexes. If not, then you may need to build one or more multicolumn indexes to satisfy the access paths of your queries. Does the access to the data vary significantly between queries? If not, place the columns that are always used at the beginning of the index, and favor a single index with multiple columns. If the answer to the above question was yes and if your DBMS supports index merging, consider building multiple single-column indexes on the table. Once you've gone through the above analysis, build your tables and indexes and test your queries against your sample table/indexes to ensure that your analysis and assumptions were correct. -------------------------------------------------------------------------------- MORE ROBUST WAYS TO COPY DATA In the past several tips, we've discussed ways to copy data from one table to another using SQL, replication, and triggers. These approaches accommodate a lot of your needs, but your options do not end here. There are situations where those approaches will not suit your needs because of performance requirements, networking issues, or scheduling requirements. In these situations, the available database options include database snapshots, as well as unload and load utilities. Snapshots are a form of replication where an entire table is copied. The snapshot can be scheduled and is a powerful technique that can be used in a production environment to move data to alternate locations. Database utilities such as unload and load, export and import, or BCP (Adaptive Server and SQL Server) offer quick ways to unload and load data from a database. These can be very efficient since the unload and load utilities can use very efficient code paths that do not log data and use internal routines that do not incur the overhead and memory structures allocated to traditional SQL. Database loads are the quickest way to put a large volume of data into a database. -------------------------------------------------------------------------------- MORE ON NAME SEARCHES: FUNCTION INDEXES Searches can be simplified and improved in Oracle's DBMS using a feature called function-based indexes. This feature allows us to perform a function on a column and have that value stored in the index to facilitate searches. An example of this is the index created below: Create Index Customer_Last_Name_Idx on Customer (Upper(last_Name)); This allows us to perform uppercase searches on columns that are mixed-case. To do this in Oracle requires proper system privileges of 'Query Rewrite' and the system parameter of 'Query_Rewrite_Enabled' being set. -------------------------------------------------------------------------------- MORE ON GENERATING KEYS: USING TIMESTAMPS Many systems try to randomize inserts using algorithms that force subsequent inserts to be performed to different locations in a table. By location, I mean any or all of the following: table partition, file, disk volume, server, database, instance. One very simple approach is to use the last digits of a datatime or timestamp column, reversed and combined with information from a userid, process id, and/or session id to provide uniqueness. This can work well if your timestamp data goes to the millisecond level. You should note that with this approach it is still possible to get duplicate values, so you should code your system accordingly. Because of the strong implementation of identity columns in DB2 and SQL Server and sequences in Oracle, you need only use this approach in extreme cases where sequential inserts will cause a system hot-spot and will fail to perform. -------------------------------------------------------------------------------- MORE INFORMATION WITH GET DIAGNOSTICS When building error checking into your programs, consider using Diagnostics. They can give you more detailed information on the type of error that was encountered. There are two forms of Diagnostics. One supplies information about the overall processing of the immediately processed SQL statement. The second gives more specific information about the errors. The syntax for Get Diagnostics is as follows: GET Diagnostics {<statement information> | <condition information>} Examples of this are GET Diagnostics :smallint_var = TRANSACTIONS_COMMITTED; GET Diagnostics EXCEPTION 1 :char_var = TRIGGER_NAME; This can be used in both Embedded and Dynamic SQL. There are many more conditions that can be checked, so consult your DBMS documentation to see what is supported in your case. -------------------------------------------------------------------------------- MIXING DDL AND DML IN TRANSACTIONS In cases where you must mix Data Definition Language (DDL) statements such as Create Table with Data Manipulation Language (DML) statements such as Insert, Update, and Delete, you must be careful where you place the DDL statements. Keep in mind that the DDL statement will perform an implicit commit and that this may occur in a place you didn't expect. Take a look the following series of steps that are performed (in this order) in a single transaction: Begin Transaction Insert into ... Update ... Create Table table_name ... Update ... Rollback Transaction End Transaction Given the above steps, it appears as though the Rollback Transaction will undo all work back to the Begin Transaction--including the Insert and two Update statements. This, however, is not what will happen. The DDL statement Create Table table_name... will perform an implicit commit, which will commit the Insert and Update statement before it. The Rollback statement will undo only the last Update statement. Your best bet is to avoid mixing DDL and DML in a single transaction. -------------------------------------------------------------------------------- MEDIANS IN SQL A common function used to determine trends in SQL is to determine the average of a set of values, as shown by: Select avg(number_column1) from table1; A better way to see a normal tendency is to find the median of a set of values. There are two categories of median. The statistical median, which has as many values above as below the median, and the financial median, where all values are split into two groups and you take the average of the lowest value from the highest group and the highest value from the lowest group. Confused yet? Take a look at an example of a list of four salaries: Person_Id Salary 1 $1,000,000 2 $ 100,000 3 $ 60,000 4 $ 40,000 The average salary from the list above is $1.2M/4 = $300,000--which is misleading since there's a single high value of $1 million. The median can be determined by splitting the above list into two groups and taking the lowest value of the top group ($100,000) and the highest value of the bottom group ($60,000) and taking the average of these two--which is $80,000. This is a better picture of what "normal" salaries are, given this list. In our next tip, you'll see an example of this. The source for this tip is Joe Celko's excellent book, "SQL for Smarties: Advanced SQL Programming," 2nd ed., Morgan Kaufmann Publishers, San Francisco, California, 2000. -------------------------------------------------------------------------------- MATERIALIZED VIEWS Oracle has a new proprietary feature that can be used largely to support data warehousing: Materialized Views. A Materialized View is similar to a snapshot in that data from a table or multiple tables can be queried, joined together, and potentially aggregated. The data is stored persistently in a separate table (called a Material View), and this data can be refreshed at predefined intervals (as defined by the Create Materialized View statement). The major benefit is that a query written against the original tables (the tables that are the source for the Materialized View) can be "rewritten" by the Database Optimizer at runtime--if it determines that the query can be satisfied by the Materialized View and if it will be more efficient than executing against the original tables. An example: Create Materialized View Sales_Cube Refresh Start with sysdate next sysdate + 1 as Select decode(grouping(region),1,'TOTAL COMPANY') REGION, decode(grouping(territory),1, 'Total Region') TERRITORY, sum(sales_dollars) TOTAL_SALES From sales Group By Cube (region, territory); The Refresh Start clause indicates the initial creation of the Materialized View, and the Next parameter states the subsequent refresh intervals. The Enable Query Rewrite clause must be used to create or alter the view, and Grant Rewrite (or Grant Global Rewrite) privilege must be applied. This is a proprietary Oracle feature, so you may need to look in your DBMS of choice for a similar option. Materialized Views can be a great benefit for data warehousing queries. -------------------------------------------------------------------------------- MANAGING PASSWORDS IN ORACLE Last time we looked at the sp_configure stored procedure, used to force users to change their passwords on a regular basis. Oracle provides a similar facility through profiles. A user profile can be created and attached to a user. For example: Create Profile New_Password_Profile Limit password_life_time 60 days; Create User test_user identified by new_password profile New_Password_Profile; The above will create a new password that forces users to change their password every 60 days. The second SQL statement then creates a new user and assigns the profile to that user. Once the password has expired, users will need to change their password using the following statement: alter user user_name identified by new_password; You can simplify this procedure for the user by providing an online transaction that performs this statement. -------------------------------------------------------------------------------- MAKING YOUR SQL PORTABLE When you're writing SQL in programs, it's important to write the SQL so that you don't need to change your code when migrating your system between environments. In other words, the move of a program from development to test to production, or even between different development environments, should not require you to change the program. Here are some things you can do to increase portability: Remove the table (or view) owner from all SQL statements. The proper table/view should be determined by the userid that was used to bind or compile the program. Set up the DB Connection ID to be entered at login time, or put it into a place that requires only a single change to the system as a program is migrated. Do not hard-code passwords into programs. Passwords should be entered by users. When designing your programs, think about what must be done to move this program to your test and production environments and how you can do this without additional code changes. -------------------------------------------------------------------------------- JOINING A TABLE TO ITSELF There are situations where a table is joined to itself rather than to another table. An example of this may be joining Employees to other Employees who are their Managers so we get both the Employees and the Managers names in a list. This could be coded as follows: Select mgr.empl_id mgr_id, mgr.name mgr_name, empl.empl_id, empl.name empl_name From Employees Mgr, Employees Empl Where Empl.manager_id = mgr.empl_id; Notice that we are joining Employees to itself by joining the Employees manager_id column (which is a foreign key) to the Employees empl_id column (the primary key). To do this requires that we create aliases, as shown in the From clause, where we use 'mgr' and 'empl' to alias the single Employees table into two logical tables. -------------------------------------------------------------------------------- JAVA SQLJ VENDOR-SPECIFIC CUSTOMIZATIONS In our previous tip, we discussed how DBMS vendors are adopting and implementing portable Java code with SQLJ. We discussed how we can use some standard, common features of multiple databases to increase portability across those databases with our application Java and SQLJ code. Proprietary features can also be included, using a feature called SQLJ Profiles. A Java binary can contain profiles that describe a vendor's customizations. You can also implement multiple profiles for multiple vendors. Using this feature, a binary that uses proprietary vendor-specific customizations can be created to run in multiple environments. This leads to increased portability across environments. -------------------------------------------------------------------------------- JAVA PORTABILITY WITH SQLJ The SQLJ standard for implementing embedded SQL in Java programs has been jointly authored and implemented by many leading DBMS vendors. This list of vendors includes IBM, Informix, Oracle, Sybase, and others. The standard allows programs to be vendor-independent, as long as the SQL-92 entry-level standard is used. This allows Java source code and binaries to be moved between platforms. As always, there are many exceptions to these rules, so always test the portability of your code before automatically assuming that this will work. This option does, however, provide database developers with a great alternative for developing applications over previously used proprietary database access languages. -------------------------------------------------------------------------------- INVERTED LIST DATABASE You may have run across the phrase "inverted-list database" and asked yourself "What is that?" You've heard of hierarchical, network, and relational databases, but what is an inverted list? An inverted-list database can be thought of as a relational database in which, in effect, each column becomes a file. This means that each column is indexed. It provides the advantages of very fast access, but this is balanced by slow updates due to the large number of indexes. These databases were very popular in the mid-1980s, when people were looking for a new alternative just before relational technology proved itself. These databases can support SQL, but they are not relational in structure. Model 204 and ADABAS are two examples of inverted-list databases. -------------------------------------------------------------------------------- INSERTS RUNNING SLOW If you are entering new rows in a table with SQL inserts and they are running slowly--and seem to be getting slower as you insert more rows--there are a few simple things that you can do to improve the situation: Consider reorganizing the indexes so there is more free space and less fragmentation. If this does not alleviate the problem, reorganize the table to remove row chaining and table reorganization. As a last step, drop all indexes that are not required for the insert operations. (Examples include non-primary key indexes and non-unique indexes.) Once the inserts have completed, re-create the indexes that were dropped. Index maintenance for inserts and deletes (and updates that affect the indexed columns) is critical. And when insert operations are running slowly, this is a good place to start. -------------------------------------------------------------------------------- INDEXING TO SUPPORT STAR SCHEMA Star schema designs are implemented in large data warehouses and data marts and have the profile of a large fact table (a transactional table) with millions--or billions--of rows, surrounded by several dimension, or lookup, tables that are all smaller and that are directly related to the fact table. In other words, the dimension tables all have corresponding keys to the fact table. Since the dimension tables are small relative to the fact table, the cardinality (or degree of uniqueness) of the dimension table foreign keys on the fact table is quite low. A reasonable indexing approach for this type of design is to implement foreign keys indexes on the fact table using bitmap indexes. These will be small, they can use compression, and they're very efficient. -------------------------------------------------------------------------------- INDEXING FOREIGN KEYS Many database designers implement indexes on foreign keys unnecessarily. When creating indexes for foreign keys, it's important to know that foreign key columns don't need to be indexed. These indexes are usually implemented to improve performance, but if they won't be used, they don't need to be created (unlike primary key or unique key indexes, which require indexes). In some cases, already implemented indexes can be used or altered to help improve performance of foreign key access. The index can include columns that make up the foreign key as well as other non-foreign key columns. -------------------------------------------------------------------------------- IDENTITY IN SQL SERVER A primary key or unique key can very simply be created in SQL Server using the Identity property for a table. This works with all numeric datatypes and allows you to set the key's initial value and incrementing value. Only one identity is allowed per table, and it must be Not Null. This will ensure that unique keys always exist for a table. Here is an example: Create Table Table1 ( Table1_Id Int Identity (125,2) not null, Table1_Desc char(50)); The Table1_Id column above is of datatype Int (Integer). The first row inserted will have a value of 125 and the incrementing will be by two. -------------------------------------------------------------------------------- HOLDING A CURSOR Two new extensions have been added to cursors with SQL-3: Hold and Sensitive. Hold may have already existed in your DBMS, and in fact, DB2 has supported this for some time now. A cursor that uses the With Hold option can be kept open after a commit operation has been performed, and the position of the cursor can be held so that the fetches can continue. Without the Hold option, you may have had to reopen and reposition the cursor to the point of the last fetch. This provides performance advantages by allowing simple commits and avoiding repositioning. It also simplifies coding logic. Here's an example: Exec SQL Declare Customer_Cursor Cursor With Hold For Select * from Customer; Oracle users may feel that this is nothing special since it's already default behavior. (Within Oracle, you can open a cursor once, commit transactions, and then continue to fetch data from the cursor.) However, anyone who has done this and experienced the snapshot too old error condition will realize that this feature is not perfect and that With Hold can be a benefit if implemented properly. -------------------------------------------------------------------------------- HISTOGRAMS DEFINED Many popular database optimizers can use histogram statistics to evaluate the most efficient query optimization. A histogram can be used on a column to specify how data is skewed: The histogram will contain a number of buckets with counts of the number of rows that exist for that bucket; or with column values that exist in a certain section of the table. For example, if there are ten buckets and 100,000 rows in a table, these could each represent 10 percent of the table. The first bucket could state how many rows are in the first 10 percent of the key range, while the second could state the number of rows in the second 10 percent of the key range, and so on. Another approach would be to state column_value at the 10 percent mark, then the column value at the 20 percent mark, and so on. Whichever approach you use, the histogram is a very useful tool for determining "data skewing" and should be considered for indexed columns that do not have an even distribution. Look for this feature in your database and remember that the greater the number of buckets in your histogram, the greater the degree of accuracy of your database optimizer. -------------------------------------------------------------------------------- HINTS IN SQL SERVER SQL Server provides the capability of telling the DBMS the type of lock you want when running a query. If no lock hint is specified, the Select statement will get a share lock on each row that's read. To avoid this locking, add the Nolock hint as follows: Select col01, col02,.... From Table1 With (NOLOCK); Some other types of lock Hints are: Updtlock: Puts an update, or U, lock on each row. Tablock: Places a lock on the entire table. Paglock: Places a lock on the data page (i.e., Block). This is not a complete list, so look up the SQL Server lock hint to learn more about locking. -------------------------------------------------------------------------------- HINTS IN ORACLE Oracle allows hints to be added to SQL statements by placing the hint after the Select statement within comments that begin with a plus (+) sign, as follows: Select /*+ hint1, hint2,... */ col01, col02 From table1, table2,... Where.... Here's an example of a hint that requests a join to be performed in the same order as the tables are listed in a FROM clause in a query: Select /*+ ORDERED */ cl01, col02,.... From Table1, Table2, Table3 Where .... This hint will try to influence the optimizer to join Table1 to Table2, and then join the result to Table3. Here's a list of five worthwhile hints: 1) /*+ INDEX (table index1, ...) */ requests a specific index be used 2) /*+ INDEX_FFS (table index) */ requests that a query be satisfied by an index-only scan without accessing the database 3) /* USE_MERGE (table1) */ requests the use of a merge scan join 4) /* USE_NL(table1) */ requests the use of a nested loop join 5) /* PARALLEL (table1, degree) */ requests that a table be accessed in parallel Oracle has many other worthwhile hints, but we mentioned these five to illustrate the types of things you can do with an Oracle hint. -------------------------------------------------------------------------------- HELP WITH NAME SEARCHES Some simple design options can be added to your system to support name searches. A simple approach is to store a separate column on which searches can be performed. (This column can be referred to as search_name.) In other words, the Customer table can have last_name and first_name columns that allow upper- and lowercase characters. To search on last_name, we'd need to perform a query such as this one: Select * from Customer Where Upper(last_name) = 'JONES'; This use of the Upper function may prevent the database from using an index. This can be resolved by creating a separate column with the name stored as uppercase characters. This would allow us to write the above query like this: Select * from Customer Where search_name = 'JONES'; If an index exists on search_name, it can be used efficiently. Search_name could also include values from multiple columns and could remove special characters, such as punctuation and spaces. Therefore, the search on the name 'St. Germain' (which might otherwise be complicated by the '.' and the extra space) could become 'STGERMAIN' in the search_name column. There is overhead to this approach since two columns need to be maintained, but if your system is a query-intensive one, this approach may be worthwhile. -------------------------------------------------------------------------------- HASHING FUNCTIONS A hashing function provides an extremely quick way to get one row of data from a database. This function operates by taking a column value (usually the key) and applying an algorithm to this value. The algorithm produces a result that maps the row of data to a logical (and ultimately, physical) address where the row can be found. This is subtly different from the hash-join that we discussed in a previous tip, but it has the same high-performance capability of mapping a key to a physical address. -------------------------------------------------------------------------------- GROUP BY VERSUS ORDER BY Though the Group By and Order By statements seem similar, it is important to understand the differences between them. With Group By, grouping is done on the tables in the From clause (on the input); Order By is done on the columns in the select list (on the output). This leads to the following: Order By columns need to be in the select list while Group By columns do not. Order By can use an integer and expression while Group By cannot. While these differences are subtle, they may explain incompatibilities you've noticed between the two expressions. -------------------------------------------------------------------------------- GROUP BY GROUPING SETS There is a new SQL3 Standard that allows grouping within groups. This syntax provides the Group By Rollup and Group By Cube features that we've already seen, along with a new feature called Group By Grouping Sets. Group By Grouping Sets can generate many aggregated groups in one pass. For example: Select r.Region, p.product, sum(p.sales_dollars) TOTAL_SALES From Region_Table r, Product_Sales p Where r.region_id = p.region_id Group By Grouping Sets (r.Region,(r.region,p.product)); The above provides a grouping by Region and then, within that, performs a grouping by Region and Product. -------------------------------------------------------------------------------- GETTING THE LAST IDENTITY VALUE USED Many times, you'll want to know the last identity (key) value that was used in an insert. The biggest reason for this is so that the same value can be reused when inserting a foreign key. This is done differently between SQL Server and DB2. In DB2, the identity values can be picked up by the application and reused using and Identity_Val_function(), which returns the most recently assigned value for an identity column. In SQL Server, the last identity value used in an insert can be retrieved with the @@identity function. -------------------------------------------------------------------------------- GENERATING RANDOM UNIQUE KEYS Unique and primary keys can be generated in random order using a function rather than by using ascending keys. An example of this type of function is one that combines two elements, such as a timestamp and userid or process ID information, to try to improve the chances of uniqueness in the generated key. The trick to doing this is to have a timestamp that has a very high precision, such as DB2's, which goes to the microsecond. Reverse the digits to avoid hotspots. When doing this, the value 20001231083045123456 would become 65432154038013210002. You can then concatenate the userid, process id, or some other easily obtained value that would help to avoid key collisions. Believe it or not, the above algorithm doesn't guarantee uniqueness, but it's still a decent algorithm. -------------------------------------------------------------------------------- FULL-TEXT QUERIES WITH CONTAINSTABLE In our previous tip, we looked at the Contains clause in SQL Server. The clause is a powerful and simple way to perform partial and exact matches and to rank the results. The ContainsTable function works in the same manner as Contains and uses the same search criteria, with one additional entry: the Table name. The result is a table with the primary key of the table being searched, along with the ranking. Here's an example of ContainsTable: Select * From ContainsTable(Customer, *, 'customername'); The result is KEY RANK --- ---- 1 98 2 96 9 57 The key value above can then be joined to the Customer table to retrieve the necessary row from Customer. This differs from the Contains clause, which would have immediately retrieved the rows from the Customer table and would have been written as follows: Select * From Customer Where Contains(*,'customername'); -------------------------------------------------------------------------------- FULL-TEXT QUERIES IN SQL SERVER WITH CONTAINS SQL Server provides the Contains clause for performing a variety of text searches. Using the Contains clause provides search results and the rank of the result based on the query. To look for a word in a table by searching every column, select * from table1: where contains(*, 'word'). To look for a wildcard, use * as follows: where contains (*,'"word*"'). To find a word near another word, use NEAR: where contains (*,'word' NEAR 'another'). To get verb changes, use FORMSOF and INFLECTIONAL. For example, to search on waste and get wasting, use: where contains(*, 'FORMSOF(INFLECTIONAL, waste)'). To get hits above a certain weight--say, .7--use: where contains(*,'word' NEAR 'another' WEIGHT(.7)). As you can see, this is a very powerful clause to consider when performing text searches. It should be noted, however, that the clause is not transportable to other databases. -------------------------------------------------------------------------------- FORCING USERS TO CHANGE PASSWORDS When you're setting up your database server, it is important to force users to change their passwords on a regular basis. This can usually be done through a database-provided feature. Today we will look at the approach used by SQL Server and Adaptive Server. To set the password expiry to a set number of days, use the sp_configure stored procedure. Using this to force users to change their passwords every 60 days would be accomplished as shown in the syntax below: exec sp_configure 'systemwide password expiration', 60 go The above will force all users to change their passwords every 60 days in Sybase Adaptive Server. To change their password, each user will need to execute the sp_password procedure on their next login after 60 days. Of course, your application will have to be written to provide a screen that facilitates this. -------------------------------------------------------------------------------- FAST, DIRECT ACCESS THROUGH A ROWID IN ORACLE In Oracle, the fastest way to access a row is to use a feature called a Rowid. This is an internal identifier that can be thought of as a pseudo-column. It never actually appears as a column if you perform a describe on a table or if you query the "column" catalog table all_tab_columns. Despite this, you can retrieve the Rowid when selecting a row from a table. Here is an example: Select column1, column2, rowid from table_name where table_key = id1; The Rowid that is returned above can later be reused to select, update, or delete that same row. It is a direct pointer to a given row in a block and a file for an object. -------------------------------------------------------------------------------- EXTENDING GROUP BY As you've seen in previous tips, Group By can allow aggregates to be performed on a single column or on a set of columns. This is a powerful, but simple, feature of SQL. However, Group By alone does not provide all of the grouping information you may want. In these cases, you have access to two other features that are used to enhance Group By: Cube and Rollup. Rollup can be used to extend Group By to aggregate at many levels, including the Grand Total. Cube can further extend Rollup by calculating all possible combinations of subtotals for a Group By. In other words, it makes cross-tab reports possible for Group By columns. In the next tip, I'll show you an example of Rollup. -------------------------------------------------------------------------------- EXAMPLE OF GROUP BY, ROLLUP, AND CUBE To illustrate the differences between Group By, Group By with Rollup, and Group By with Cube, let's analyze an example of each. For starters, we query the Sales table using Group By, which provides the Total_Sales for each unique set of values as shown below: Select region, territory, sum(sales_dollars) TOTAL_SALES From sales Group By region, territory; The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 WEST 1 3000.00 WEST 2 500.00 Next, we extend the above query with the Rollup clause, which provides totals by Region and an overall total for the entire result set. Select region, territory, sum(sales_dollars) TOTAL_SALES From sales group by region, territory with rollup; The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 EAST NULL 3500.00 WEST 1 3000.00 WEST 2 500.00 WEST NULL 3500.00 NULL NULL 7000.00 Finally, we extend the query one last time with Cube, which provides totals by Territory, regardless of Region: Select region, territory, sum(sales_dollars) TOTAL_SALES From sales Group By region, territory with Cube; The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 EAST NULL 3500.00 WEST 1 3000.00 WEST 2 500.00 WEST NULL 3500.00 NULL 1 4500.00 NULL 2 2500.00 NULL NULL 7000.00 -------------------------------------------------------------------------------- ESCAPE CHARACTERS Sometimes in SQL you want to perform an operation such as Like but want to remove the meaning of a special character. For example, you may want to perform the following search: Select * from Customers Where customer_name like 'Smit_'; The above will return all rows that have a customer_name that is five characters long where the first four characters are Smit and the fifth character can be anything as indicated by the single character _ used with Like. Suppose however, that you wan to find all Customers with a seven-character customer_name where the first six characters are Smit_x Clearly, you would need to remove the special meaning of this _ when used with the Like clause. You do this with the Escape clause, by coding the keyword Escape along with an Escape character at the end of an SQL statement. You then add this Escape character immediately before the special character that you want to disable. In other words, to find a seven-character customer_name where the first six are Smit_x you would code the following: Select * from Customers Where customer_name like 'Smit$_x_' ESCAPE '$'; When doing this, be careful that the Escape character is not an actual data value. -------------------------------------------------------------------------------- ENHANCING SECURITY WITH ORACLE In our previous tip, we discussed user profiles, which can be used to force password changes after a set number of days. Now, let's look at ways in which profiles can be used to enhance security in Oracle. You can add many password features to profiles in Oracle, including: Failed_Login_Attempts: Number of consecutive failed logins that will an account to be locked. Password_Reuse_Time: Number of days before a password can be reused. Password_Reuse_Max: Number of times a password must be changed before it can be reused. Password_Lock_Time: Number of days an account will be locked after failed_login_attempts is reached. Password_Grace_Time: Number of days that a password can be used after the Password_life_Time has been reached. Password_Verify_Function: A function that enforces a level of complexity for a password (for example, a password must be at least six characters long and must be a mixture of alpha and numeric characters). Password_Life_Time: Number of days after which a user's password must be changed. Once a profile has been set up in Oracle, its settings can then be added to other user profiles, or can be set up as the Default profile, which will automatically be picked up by all users in the system. -------------------------------------------------------------------------------- DROPPING COLUMNS CAREFULLY DROPPING COLUMNS CAREFULLY In our previous tip, you saw that DBMSs provide the ability to drop a column. So, what happens if another object uses this column? For example, what if a view has a column created on the table column to be dropped? The SQL3 standard states that the Restrict keyword can be added to the drop column option to allow a column to be dropped only if there are no other objects dependent on that column. In other words, if views, stored procedures, triggers, or any other database object was created on that column--and if the Restrict keyword was specified--the column will not be dropped. Here's an example of this syntax: Alter Table Product drop column product_short_name Restrict; This is SQL3 syntax and its implementation will vary from database to database. -------------------------------------------------------------------------------- DROPPING A TABLE WITH CONSTRAINTS Similar to truncating a table, a table with referential constraints (as a primary key to another table) cannot simply be dropped. Here, then, are the ways you can drop the table: Disable constraints with the Alter Table table_name disable constraint referential_constraint_name; statement, drop the table, and re-enable the constraints. Drop the constraints and then drop the table. Perform a Drop Table table_name Cascade; operation. This will drop the named table, as well as all related tables. I prefer the first option since all constraints remain defined in the database catalog. The Drop Table table_name Cascade operation may perform a drop of a table that you wanted to keep. I recommend using this only in a development environment. -------------------------------------------------------------------------------- DROPPING A COLUMN Many databases now provide the capability to drop a column. Previous to this, you had to drop a table and re-create it without the column being dropped. This required a Table Unload and Load to be performed. Obviously a simple SQL command is easier and quicker. Here's an example of this operation in SQL Server: Alter table Product drop column product_short_name; One thing to keep in mind: The space used by the dropped column may not be released at this time. Although the column is not visible or useable by your users, the internal space used by the column data may remain until the table is eventually reorganized. Different DBMSs handle this differently. -------------------------------------------------------------------------------- DOUBLE QUOTES AND SINGLE QUOTES When should you use single or double quotes? Depending on your DBMS and the way it's implemented, it's possible that both may work. The ANSI standard, however, is to use single quotes for string literals and to use double quotes for table and column names. The following is an example of the correct use of literals in ANSI SQL: Select "column1", "column2" From "Table1" Where Column1 = 'abcdef'; You can change the title of a column to be used when the result set is displayed, using double quotes as follows: Select column1, column2 "New Column Name" From Table1 Where Column1 = 'abcdef'; You usually do not need to use double quotes for column or table names, but if there is an unusual character in the name, you may need to use the quotes. I've seen this occur when a Microsoft Access Table was exported to Oracle using lowercase table and column names--in which case I had to use double quotes to drop the table. In general, the single quotes should be honored for literals, thereby complying with the ANSI standard. -------------------------------------------------------------------------------- DIRECT ACCESS THROUGH A ROWID IN DB2 Similar to Oracle's implementation, IBM's DB2 has an internally generated unique row identifier called a Rowid. The implementation of this, however, is different than Oracle's. To use a Rowid in DB2, a column must be created on a table using the Rowid datatype. This column can then be retrieved in a Select and stored in a program's host variable. It can later be used to directly retrieve the row in the table to perform a Select, Update, or Delete. When using a Rowid in this way, a row is directly accessed without performing a Table scan and without needing to use an Index. Here is an example: Select * from table_name where rowid_column = :hostVarRowid; Notice that in DB2, Rowid is actually a datatype of a column. -------------------------------------------------------------------------------- DENORMALIZATION: VERTICAL PARTITIONING Vertical partitioning is the splitting of a table into two (or more) tables, each a subset of the columns. All rows will exist in all of the tables, but a subset of the columns will be implemented. This is not a common splitting technique but could be used to implement mutually exclusive subtypes where there is no overlap between each subtype. In other words, a table may have three columns that make up Subtype 1 and totally separate four columns that make up Subtype 2. If these columns will never be populated together, it makes sense to split the two subtypes into separate tables. Doing so will save space and deliver tables that clearly implement as intended. -------------------------------------------------------------------------------- DENORMALIZATION: HORIZONTAL PARTITIONING Tables can be split into two or more partitions for a number of reasons, including To separate data between many applications To separate data geographically To make the tables smaller To tune each table differently for diverse uses For security reasons and for manageability One way to split tables is range values so that one set of ranges that includes all of the columns and a subset of rows is put into one table, and another set of rows into the other table. This is referred to as horizontal partitioning of a table. This type of partitioning can also be performed using DBMS partitioning features, where a single table is created and separate physical partitions are created in a manner that's transparent to end users and developers. This provides the advantages of partitioning along with the benefits of having all the data in a single table. -------------------------------------------------------------------------------- DENORMALIZATION: COPYING COLUMNS One simple way to achieve performance gains is to copy columns to tables where they're needed. This is useful if you can avoid reading another row and if it doesn't create any update and maintenance problems. An example of this is the case of a parent table and a child table--which we'll call Customer and Customer_Address. If the only column needed on Customer is Customer_Name and we always need to access Customer_Address, we could copy (or move) the Customer_Name column to the Customer_Address table to save I/O to Customer. This should be performed only when you absolutely need the performance gains that result from the denormalization. The disadvantage of this approach is that you need to keep multiple copies of the Customer_Name column, and therefore, data maintenance requires more overhead and data integrity is compromised. So, do this ONLY when absolutely necessary. -------------------------------------------------------------------------------- DEFINING IDENTITY COLUMNS IN DB2 Identity columns can be used in DB2 version 6 and provide a way to generate sequential, unique, primary keys. When defining the column, use the following syntax (paraphrased from the IBM redbook titled "IBM UDB Server for OS/390 Version 6 Technical Update"): column_name COLUMN OPTIONS Generated {always} {by default} as identity start with value_1 increment by value_2 cache value_3... Specify Generated if the column is an identity column or if the column's data type is ROWID. Always means that DB2 always generates a value for the column when a row is inserted, while by default means that DB2 will generate a value only when a row is inserted without a value specified. It is recommended to avoid using Overriding User Value--which allows the application to use its own value--because this is invalid when using the generated by default approach. Note that in cases where you unload and reload data, you will want to use generated by default. The cache parameter specifies the number of values of the identity column that DB2 will keep in memory. Use cache 20, since the IBM redbook performance tests did not show any major performance differences between cache 20 and cache 200. Identity columns perform very well. The one thing to consider is that in cases of extremely active indexes you may hit the maximum insert rate. An example is shown below: Create table Customer ( cust_no integer generated by default as identity (start with 1, increment by 1, cache 20), cust_name char(18)); Insert into Customer (cust_no, cust_name) values (default, :v_cust_name); Tim Quinlan is a database consultant in the Toronto area specializing in relational data warehousing and OLTP architecture, design, and implementation. -------------------------------------------------------------------------------- DECIPHERING AN ORACLE ROWID An Oracle Rowid needs to be deciphered using Oracle built-in functions. Without these functions, the Rowid is unintelligible (unless the Rowid is in the older, restricted format). To get the row number, block number, and file number, use the DBMS_ROWID package as shown below: Select DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid), DBMS_ROWID.ROWID_TO_ABSOLUTE_FNO(rowid), DBMS_ROWID.ROWID_ROW_NUMBER(rowid), from table_name where table_key = id1; This will return information similar to the following: BLOCK FILE ROW 321 52 4 -------------------------------------------------------------------------------- DECIDING BETWEEN EMBEDDED SQL AND CLI When deciding which implementation of SQL (CLI or Embedded) to choose, consider the following: Embedded SQL is simpler to write and understand. Embedded implementations are much different from one DBMS to another, and may ultimately result in a CLI implementation after the embedded SQL has gone through the precompiler. CLI implementations are becoming more popular than are embedded implementations. Once again, this differs from one DBMS to another. When choosing between approaches, be sure you understand the performance impacts of the different implementations and the approach favored by your DBMS of choice. -------------------------------------------------------------------------------- DEALING WITH NULLS IN EMBEDDED SQL When executing embedded SQL in a program, you will encounter situations where you need to deal with null values. You will either select data from columns where you are receiving null values or you will want to set some columns to null values. Since host languages often do not deal with null values, this is performed in embedded SQL with indicator variables. To perform SQL in this way, use a host-variable as normal, but follow this with another numeric variable (usually a 32-bit integer) that indicates whether the value is null. A negative value in the indicator variable means that the value is null, while a zero or positive number in the indicator variable means that the value is not null. An example of an Insert of a not null value is shown below: vCustNo = 12345; vCustNoIndicatorVar = 0; exec SQL Insert into Customer values(:vCustNo :vCustNoIndicatorVar); This can also be done with the optional Indicator keyword, as follows: exec SQL Insert into Customer values(:vCustNo INDICATOR :vCustNoIndicatorVar); -------------------------------------------------------------------------------- DEALING WITH NULLS IN CUBE AND ROLLUP You may have noticed with Cube and Rollup that you check for Null values to determine whether a higher level summary (rollup) has been reached. But what if the value returned by a higher level (such as Territory in our example) actually is Null? How do you distinguish whether the Null is returned by the query or by a Rollup? Rollup and Cube provide a facility where a value of 1 is returned if the Null is a result of Cube or Rollup, and 0 is returned if it is a natural result. To take advantage of this in Oracle, write the query as follows: Select decode(grouping(region),1,'TOTAL COMPANY') REGION, decode(grouping(territory),1, 'Total Region') TERRITORY, sum(sales_dollars) TOTAL_SALES From sales group by rollup (region, territory); The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 EAST Total Region 3500.00 WEST 3 3000.00 WEST 4 500.00 WEST Total Region 3500.00 TOTAL COMPANY 7000.00 The decode function is an Oracle translation that changes the grouping indicator of 1 into TOTAL COMPANY or Total Region. Next time, we'll see SQL Server's solution. -------------------------------------------------------------------------------- DEALING WITH FOREIGN KEYS In some situations, you may want to truncate (delete all the rows within) a table but can't because you receive a message stating that referentially related rows exist and the table cannot be truncated. In other words, this table has a primary key and one or more foreign key relationships. You may know that the other table with the foreign key relationships is empty, so it doesn't seem to make sense that you received this message--until you realize that the truncate operation is a table-level one. In other words, each row is not checked and the table itself is simply flagged as empty. This is the reason why truncate operations are so quick. Individual rows are not dealt with or logged. Only the truncate operation is logged. It is not possible to roll back a table truncate. Though the message states that rows may exist that point to the rows to be deleted, this may not actually be the case. In our next tip, we'll look at a simple way to deal with this situation. -------------------------------------------------------------------------------- DATABASE REPLICATION VERSUS TRIGGERS Both database replication and triggers can be used to copy data from one table to another. However, they are very different procedures. Database replication is a powerful and complicated capability that provides features such as snapshots and scheduled replication of tables, rows, and transactional data. The scheduling can allow a near-real-time capability. This replication is done in an asynchronous manner that delivers high performance at the source site by queuing the data to be replicated locally and moving the data from the queue to the target table as a second transaction in a separate commit scope. Replication can be performed between many tables and in each replication group, a table can be both the source (master) of data as well as the target. This approach is known as a store-and-forward architecture and is similar to message-oriented middleware (MOM) such as MQ Series. Replication environments are very tuneable, scaleable, flexible, and complicated. Database replication can be implemented using tools such as Oracle's Advanced Replication, Sybase's Replication Server, and Microsoft SQL Server Replication. Triggers, on the other hand, can take a row that has changed through an Insert, Update, and Delete and can copy that data to another table. The difference between this and replication is that triggers perform all of their work in a synchronous, consistent manner. In other words, the trigger copies the data to the other table in the same commit scope as the originating transaction. The trigger and transaction must all succeed or fail at the same time. This is a simple procedure (compared to replication) but can be a useful way to copy data in the right situations. Before plunging into one of these options, discuss your requirements with your database administrator and get some expert advise on which to use. -------------------------------------------------------------------------------- DATABASE HINTS There is a mechanism to modify the access path chosen by the database optimizer in some DBMSs, including Oracle and SQL Server. This feature is known as a hint. A hint is useful if you know the characteristics of your data and want to influence the execution plan. The hint does not guarantee that the optimizer will use that access path--it only attempts to influence the optimizer to use it. When using a hint, be sure you understand the SQL statement, the data on which the SQL statement is operating, and the future changes that may occur to this data. Keep in mind, too, that you're trying to outguess the optimizer. As your data characteristics change over time, your assumptions may become obsolete and the access path may become inefficient. Consider placing hints in views rather than in programs, thereby allowing quick, dynamic changes to be made if a problem is encountered with the hint. In upcoming tips, we'll see how hints are implemented in different databases. -------------------------------------------------------------------------------- CURSOR TYPES IN SQL SERVER Microsoft SQL Server provides three types of cursors: Static, Keyset, and Dynamic. Static cursors copy the result set to a temporary table in tempdb and do not change for the duration of the query. In other words, subsequent updates to the table do not affect it. Keyset cursors copy the result set keys from the underlying tables to a temporary table in tempdb. The keys in the keyset will not be changed while the cursor is open. Inserted rows will not insert into the keyset, and deletes will not remove keys from the keyset. Also, attempting to Fetch a row after updating it may return an error. Dynamic cursors access the underlying tables directly without copying data to tempdb. Consider these options carefully when deciding which to use in your programs. -------------------------------------------------------------------------------- CUBES IN SQL SERVER AND ORACLE Even though the Cube feature is supported by both Oracle and SQL Server, and even though their results are similar, the two syntax are very different. In fact, I've discussed this in passing recently, but it bears repeating. Let's take a look at the difference in syntax one more time. The Oracle syntax to produce a Cube (using the Decode feature to deal with Null values) is shown below: Select decode(grouping(region),1, 'TOTAL COMPANY') REGION, decode(grouping(territory),1, 'Total Region') TERRITORY, sum(sales_dollars) TOTAL_SALES From sales Group By Cube (region, territory); The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 EAST Total Region 3500.00 WEST 1 3000.00 WEST 2 500.00 WEST Total Region 3500.00 TOTAL CO. 1 4500.00 TOTAL CO. 2 2500.00 TOTAL CO. Total Region 7000.00 In this case, the decode function is an Oracle translation that changes the grouping indicator of 1 into another value of TOTAL COMPANY or Total Region. SQL Server syntax for Group By Cube uses the grouping value of 1 to distinguish that a Null is the result of a break. In SQL, you also use the Case feature as shown below: Select CASE WHEN (grouping(region) = 1) THEN 'TOTAL COMPANY' ELSE ISNULL(region, '?') END as 'REGION', CASE WHEN (grouping(territory) = 1) THEN 'Total Region') ELSE ISNULL(territory, '?') END as 'TERRITORY', sum(sales_dollars) as 'TOTAL_SALES' From sales Group By region, territory with Cube; The results of this query would be: REGION TERRITORY TOTAL_SALES ------ --------- ------------ EAST 1 1500.00 EAST 2 2000.00 EAST Total Region 3500.00 WEST 3 3000.00 WEST 4 500.00 WEST Total Region 3500.00 TOTAL CO. 1 4500.00 TOTAL CO. 2 2500.00 TOTAL CO. Total Region 7000.00 -------------------------------------------------------------------------------- CUBE DEFINED Group By syntax can be enhanced with the Cube feature, which can provide totals and subtotals for different combinations of columns and categories in a query. For example, say you're querying a Sales table and grouping Total_Sales by Region and Territory. In this case, the Group By would return totals for each unique combination of Region and Territory. Using the same example, Rollup will add to the above by providing subtotals for Regions and a total for all Regions combined. Cube takes this one step further and returns totals for each Territory, regardless of Region. In other words, Cube gives you totals for all combinations of columns chosen in the Group By clause. (This, by the way, is a trademark of OLAP Services used in data warehousing.) -------------------------------------------------------------------------------- COUNTING DISTINCT VALUES There are some cases where you can have more than one unique value for a column in a result set. You can easily get the total number of values with the following query: Select count(*) From Table Where column_name is not null; This, however, only gives the number of not null values but does not indicate how many of these are unique. To do this, you combine functions as follows: Select count(distinct column_name) From Table Where column_name is not null; -------------------------------------------------------------------------------- CORRELATED UPDATES IN ORACLE It is possible to update one column in Oracle with the value from another column. In order to update column1 in table1 with the value of column2 in table2 where the keys match, the following statement should be used: Update table1 a Set a.column1 = (select b.column2 from table2 b where b.table2_id = 1234567) Where a.table1_id = (select column2 from table2 where table2_id = a.table1_id); The above updates table1 and sets column1 to a value using a correlated subselect. The row that is updated is determined by a where clause, which positions the row based on another subselect. The two subselects may seem unnecessary, but one is used to populate the column value, while the other is used to set the position of the row in the table to be populated. -------------------------------------------------------------------------------- COPYING DATA FROM TABLE TO TABLE Are you interested in copying data from one table to another? Where do you begin with all the possible approaches? Well, for starters, each of the following procedures would do the trick: Create a table, insert into a table, create a snapshot, use database replication, use snapshots, and unload/load utilities. They all have their place--as well as their advantages and disadvantages. Let's begin with a simple move of data from one table to another, where performance is not a large concern and where this is going to be performed only once (or at least not on a regular basis): In Oracle you can create a new table by selecting data from another, as shown in the statement below: Create table new_table as select * from old_table You can do this using the nologging option to make it more efficient. In SQL Server, the above is accomplished with a different syntax: Select * into new_table from old_table To do the above, you need the select into/bulkcopy option set to true in the SQL Server database. Neither of the statements above copies the referential integrity constraints. You will get only the base table and data. Another approach is to copy data from a table into an existing table using the following syntax (in most databases): Insert into new_table Select * from old_table; However, this can be a relatively slow process since each row is logged in the database--but it is simple and worthwhile for one-time use or small loads. -------------------------------------------------------------------------------- COMPARING MULTIPLE COLUMNS WITH IN In some cases, you may improve performance by using the In clause (rather than Exists) in a subquery. An example: Select o.* From Outer_Table O Where o.column1 in (Select i.column1 From table2 I); Of course, the above could be written as a Join, but it still illustrates the point. If you need to compare multiple columns in the subquery, you can simply concatenate them as follows: Select o.* From Outer_Table O Where o.column1 || o.effective_date in (Select i.column1 || max(i.effective_date) From table2 I Group By i.column1); This will return the most recent row as determined by the max(effective_date) in the subquery. In the example above, you may need to cast the effective_date to a character data type to perform the concatenation since each column needs to be of the same datatype. -------------------------------------------------------------------------------- COLUMNS LISTED IN ORDER BY STATEMENTS In a previous tip, I stated that columns listed in the Order By clause of a SQL statement must exist in the Select column listing. Since many readers commented on the fact that many DBMS haven't enforced this for some time, I thought it best to clarify the point: SQL-92 (which many DBMSs claim to implement) requires that a sort key (i.e., an Order By column) exist in the Select list. This has changed in SQL3 (aka SQL-99). Therefore, the following statement--which was illegal in SQL-92--is valid in SQL3: Select column1, column2 From table_name Order By column3; Your DBMS of choice may already support this syntax. -------------------------------------------------------------------------------- COLUMN FUNCTIONS In our previous tip, we discussed scalar functions that operate on a single column or expression. Column functions differ from scalar functions in that column functions operate on a set of data. In other words, a column function may operate on a group of rows. Examples of column functions are avg, count, max, min, and sum. The differences between column functions and scalar functions may appear subtle, but it is important to distinguish between them when you're writing SQL. -------------------------------------------------------------------------------- CODING SQL STATEMENTS Database optimizers use a cost-based algorithm, where costs are applied to different access paths for a query and the path with the lowest cost is used. Given this information, we are told not to worry about how we code our SQL statements. This is generally true, but I will let you in on a secret: The way we code our queries can in fact influence the optimizer to favor a given access path. Therefore, in those cases where you know how data should be accessed, order the tables in the from clause of your SQL statement in the order they should be accessed and place your where clause predicates in the order that they should be applied--which will be consistent with the From clause ordering. This may make a difference in the access path that is chosen by your DBMS's optimizer. -------------------------------------------------------------------------------- CHANGING COLUMN PROPERTIES WITH CONSTRAINTS You may encounter a situation where you want to change a column's properties (datatype, length, precision, or nullability) and encounter a message that states that the change cannot be performed because the new properties are not consistent with a related (primary or foreign key) column. Upon review of the related column, you may find that the new properties are in fact the same as those of the related column. The way to resolve this is to disable the constraint, alter the column, and enable the constraint afterwards. To disable a constraint, use Alter Table syntax, as shown below: Alter Table table_name disable constraint referential_constraint_name; -------------------------------------------------------------------------------- CHANGING COLUMN LENGTH AND TYPE Changing a column's properties (such as length, precision, nullability, and datatype) is a feature that is implemented in most major DBMSs. In SQL Server, the Alter Table...alter column... syntax is used, as shown below: Alter Table Product alter product_name char(30) null; Oracle uses a similar approach with the Alter Table... modify... syntax. When performing a change to a column's properties, you must observe the following rules: The datatypes must be compatible, unless there is no data in this column for all rows in the table. These rules are defined by the DBMS you are using. The new column length and precision must be at least as large as the previous column, unless all values for this column are empty. The column cannot be altered to be not null, unless the table is empty or all values are already not null" Referential constraints may disallow the changing of a column's properties. Of course, any change to a column's properties will need to be analyzed to ensure that all procedures using the columns will continue to operate as expected. -------------------------------------------------------------------------------- CASE TOOLS Case tools can be used to design or model your database. They can be modeled using standard E-R (entity-relationship) modeling or O-O (Object-Oriented) techniques. With these tools, your system can be logically modeled to capture all business rules that need to be implemented in the database. The model can then be forward-engineered to the physical data model. At this stage, the DBA can denormalize the model to implement physical design improvements to improve performance, or simplicity, of the system. Physical options can be added to the model, including the following: Views aliases and indexes can be created. Physical table objects such as tablespaces, datafiles, and space estimates can be added to the model. Triggers used to implement business rules can be managed in a case tool. Column and table names can be changed to support site standards or DBMS limitations. Denormalization, such as column movement, table splits, or merges, can be performed. Case tools should be used to help design and manage your database. The models (pictures) of the database are powerful tools to help you understand the database. When using these tools, keep in mind that they're quality-improvement rather than productivity tools. In other words, it may take a great deal of time to keep a case tool up-to-date, but the quality improvements that result from performing this work are worthwhile. Some examples of popular case/modeling tools are CAOs ERWin, Oracle Designer, and Rational Rose (for object modeling). -------------------------------------------------------------------------------- BLOBS--PART 3 OF 12: CASTING
The Cast operation is valuable when used with BLOBs. As a reminder, Cast is used to convert a given value of one datatype to a different type. It converts a source datatype to a target datatype, in other words.
When used with BLOBs, this operation can cast one BLOB object (the source) to another BLOB datatype (the target), or more commonly to a User-Defined Type (UDT). When converting to a UDT, use a UDT Cast function since a standard Cast function will not exist in your database. To do this, the DBMS uses the source BLOB as the argument to the user-defined Cast function and returns the new UDT BLOB result.
----------------------------------------------
BLOBS--PART 4 OF 12: CONCATENATING
Is one BLOB not enough? Do you need to combine more than one? The good news is that BLOBs can be combined through concatenation using the concatenation operator ||. Each value in this operation must evaluate to a BLOB.
For example:
1) X'1234' || X'5678' = X'12345678' 2) blob_column1 || X'1234' = blob_column1 with X'12345678' appended 3) X'1234' || blob_column2 = X'1234' appended by blob_column2 4) blob_column1 || blob_column2 = a single BLOB column of blob_column1 followed by blob_column2.
Keep in mind that the length of the new column will be set as the sum of the length of the two combined columns.
---------------------------------------------- BLOBS--PART 5 OF 12: ASSIGNING VALUES The assignment of a BLOB to a string is done one octet at a time, from left to right. The size of the final string will be either the maximum possible size of the string or the size of the BLOB itself, whichever is smaller. If the string is too short for the BLOB, the assignment will fail and an SQLSTATE error of 22001 will be issued--due to the fact that the right part of the BLOB would be truncated. If, however, this BLOB was too large for the string but the values that would be truncated are all 0's, then the assignment will take place and the 0's will be truncated. Here are some examples where a source BLOB is moved to a target defined as BLOB(4): 1) X'12345678' assignment to the target gives 12345678 2) X'123456789' assignment to the target fails with a truncation error 3) X'123456780' assignment to the target gives 12345678 -------------------------------------------------------------------------------- BLOBS--PART 6 OF 12: COMPARING When two BLOB values are compared, they're evaluated one octet (set of 8 bits) at a time from left to right. Since only equality is supported through native database services, only the equal and not equal conditions (+ and <>) are supported. Two BLOB values are considered to be equal if each BLOB is the same length and the corresponding octets are all equal. This may sound limiting, but it actually makes a great deal of sense when you consider what can be entered in a BLOB. For example, what would make one image BLOB greater-than another? What makes one video Like another one? There may be criteria that you want to apply to these, but it would be difficult to generalize them across all BLOBs, which may be complex objects of different types. There is, however, a way around this limitation, since you are able to define other operators through User-Defined Types (UDTs). Operators can be created and/or overloaded to develop new ones that support the specific needs of a UDT. -------------------------------------------------------------------------------- BLOBS--PART 7 OF 12: SUBSTRINGS As with other datatypes, the Substring function is supported for BLOBs. It takes the form of Substring(argument FROM start_position [ FOR substring_length]) Given the above, argument can be a column_name, a variable, or a hard-coded string in Hexadecimal format. The start-position is the first octet (8 bits) that you want to get, and the FOR argument is the number of octets that you want to extract. The FOR argument is optional and when omitted, the Substring goes to the end of the BLOB. Some examples: substring(column1 From 51 For 4) => returns octets 51-54 from column1 substring(X'123456' From 2 For 1) => returns X'34' substring(X'123456' From 2) => returns X'3456' -------------------------------------------------------------------------------- BLOBS--PART 8 OF 12: OVERLAYING There is a feature supported for BLOBs that allows a substring of one BLOB column to be removed and another BLOB to be moved into its place. This is called an Overlay function. The syntax for this function is overlay (blob_string1 placing blob_string2 from start_position [ FOR substring_length]) In effect, this combines the Substring function with the Overlay feature. The entire blob_string2 string is placed into blob_string1. -------------------------------------------------------------------------------- BLOBS--PART 9 OF 12: TRIMMING Similar to the trim function on other non-BLOB datatypes, you can remove either trailing or leading characters from BLOBs (or you can do both). To do this, define a hexadecimal character using X before the string that is to be trimmed. If you don't specify the characters to be trimmed, the default characters will be X'00's. The syntax for performing this is: trim(leading|trailing|both blob_string1 from blob_string2) In the above, leading, trailing, and both are optional, as is blob_string1 Here's an example of the trim function: trim(both X'00' from X'0000001234000000') => returns X'1234' -------------------------------------------------------------------------------- BLOBS--PART 10 OF 12: POSITIONING AND LENGTH The placement of a character or string in a BLOB can be determined using the Position feature. The syntax for this is position (blob_string1 in blob_string2) The above will return the starting position of blob_string1 in blob_string2. A similar feature that may be used along with Position is the bit_length operator, which returns the length of a BLOB in bits. For example: bit_length(X'AB') = 16 Similar operators to this for BLOBs are char_length and octet_length, which return the length of a CLOB (Character Large Object) in characters and octets respectively. For example: char_length(X'AB') = 2 octet_length(X'AB') = 2 These features can help parse and substring BLOBs. They can be used to reduce the amount of data that is transferred from the database server to the application. -------------------------------------------------------------------------------- BLOBS--PART 11 OF 12: LIKE AND ESCAPE As with other columns, you can use the Like operator with BLOBs, albeit in a different fashion. This difference is due to dealing with hexadecimal characters and octets (sets of 8 bits). You can use Like and Not Like, but rather than using the _ character to wildcard a single character and the % character to wildcard 0 or more characters, you use the values X'5F' to represent a single wildcard character and X'25' to represent a wildcard of 0 or more characters. In other words, X'5F' is the same as _ and X'25' is the same as %. An example of this is the case where you search for a string of X'1234' anywhere in a BLOB column called blob_column1. To do this, code the following: Select * From table1 Where blob_column1 like X'25123425'; If you need to find 1234 as well as the characters 25, you need to use the Escape clause. With Escape, you can define an octet to tell the compiler to treat the next character as itself rather than as a wildcard. For example, to find a BLOB with X'123425' using the Escape characters X'EE', code the following: Select * From table1 Where blob_column1 like X'1234EE25' escape X'EE'; -------------------------------------------------------------------------------- BLOBS--PART 12 OF 12: CLOB AND NCLOB BLOBs are used to store Binary Large Objects in hexadecimal format, while CLOBs are used to hold Character Large Objects and NCLOBs hold National Character Large Objects. Until now, I've concentrated on BLOBs, but the three are similar in nature and all members of the LOB (Large Object) family. CLOBs and NCLOBs have the following features: They have a definable character set. They are useful for storing large text. As with BLOB, a lot of operators are not available, such as >, <, primary key, foreign key, order by. Also, as with BLOBs, CLOBs use the LOB locators to define their locations in the database. (They are used on the database side to limit the amount of data moved between the database and the client/user application, allowing us to manipulate the CLOB in chunks.) In short, if you know how to work with one LOB type, you'll be able to work with them all. -------------------------------------------------------------------------------- BEWARE OF KEYWORDS Be careful not to use SQL keywords in any of your table, column, or statement names. They can cause compile, or even worse, runtime errors with current or future versions of your DBMS. You'll need to know those reserved words that are supported by your DBMS as well as those supported by SQL-3. The latter is critical, because if your DBMS allows those reserved words now, it probably won't in a future version. Examples of reserved words are Add, Date, If, and Start. If you don't have the full set of SQL-3 or DBMS reserved words memorized when generating DDL or writing programs, use your best judgement. Keep in mind that certain names, such as datatype names, cannot be used. Using two- or three-part names separated by underscores is usually safe. -------------------------------------------------------------------------------- BENEFITS OF IDENTITY COLUMNS IN DB2 As with SQL Server, DB2 for OS/390 version 6 implements identity columns to generate ascending keys for tables. The version 6 implementation has the following advantages (paraphrased from the IBM redbook titled "IBM UDB Server for OS/390 Version 6 Technical Update"): There is improved concurrence and performance over a next-number table, which can turn into a bottleneck. Uniqueness is always guaranteed. Recoverability in the event of failure is guaranteed by reconstructing data from the log. If a member in a DB2 data sharing group fails, other members can continue to generate new unique keys. This is a simple and flexible implementation. Add to this the fact that DB2 type 2 indexes--the only type allowed in version 6--do not incur locking, so this improves concurrence. DB2 UDB for Intel and UNIX platforms implement identity columns in version 7. Tim Quinlan is a database consultant in the Toronto area specializing in relational data warehousing and OLTP architecture, design, and implementation. -------------------------------------------------------------------------------- BE CAREFUL WITH ARITHMETIC PRECISION When performing arithmetic operations on two columns containing different datatypes or different scale and precision, the DBMS will generally use the type with the greatest precision. In other words, if an integer column is used with a decimal datatype, the properties of the decimal column will generally be used. As a rule, when you know columns will be used together in an arithmetic operation, define them the same way. If this is not possible, test the possible values that will result from different, extreme arithmetic operations to ensure that your system handles them properly. -------------------------------------------------------------------------------- BE CAREFUL USING INTEGERS WITH ORDER BY A commonly used feature for sorting results was to use integers rather than the column name to determine sort order. Doing this required that you list in the Select list the number of the column by which the data should be ordered. Accordingly, the following two statements deliver the same result: Select column1, column2, column3 From Table_name Order by 2; Select column1, column2, column3 From Table_name Order by column2; Note that column2 is second in the Select list, so in the first example it was indicated by the integer 2. This is a great example of shorthand that can be used for writing quick and dirty SQL statements but should not be implemented in programs. In this case, here's why: The order of columns in the Select list may change, which would necessitate a change in the Order By clause. This syntax, though valid in SQL-92, is no longer valid in SQL3. Again, do NOT use the integer shortcut in your programs. -------------------------------------------------------------------------------- APPLICATION GENERATED KEYS THROUGH A TABLE It is possible to generate keys using a single-column and single-row table with the last key used. Any application needing a new key would simply use the next value from the table. It would then increment the key value, update the key, and perform a commit. The business transaction could then begin using the key that was read. This quick commit is important to speed the transaction and to improve concurrence. Without it, the single row would become a bottleneck as other transactions queued up waiting for locks to be released. Also, your system should allow for gaps in key ranges since some business transactions will not be completed and the keys read will not be used. As it is, the single-column/single-row table will become a bottleneck for high-transaction volumes, and where possible, DBMS-implemented features should be used instead. -------------------------------------------------------------------------------- API VS. EMBEDDED SQL Embedded SQL is an approach where SQL is embedded directly into the source code of a program. The program goes through a precompile stage where the SQL and program are separated and ultimately transformed into executable modules that must be executed together. Embedded SQL is an ANSI standard and provides a great deal of control and tunability for developers and DBAs. This is not, however, as flexible as a SQL API, popularized by DB2 and now implemented in all major DBMSs. An example of an embedded SQL is a C program with embedded SQL or Java with SQLJ. Note that SQLJ does not always provide higher performance since application code may be turned into JDBC by the precompiler. A Call Level Interface (CLI), also referred to as an API (application programming interface), provides a dynamically callable interface and middleware that accepts a SQL statement coded as an API and communicated with the database, which returns the result rows to the program. There is no precompile step since the SQL is passed as is to the CLI middleware. This was initially implemented by Sybase/SQL Server and is seen in such popular tools as ODBC, JDBC, and dblib. This approach is very flexible for developers but does not perform to the same level as embedded SQL. -------------------------------------------------------------------------------- ANY AND SOME A subquery may be evaluated using the Any or Some operation--both terms are synonymous. Here's an example: Select * From Invoice I Where invoice_value > ANY (Select total_amount From purchase_order); If invoice_value is greater than at least one value in the subquery, it is true. The term Some could have been used in place of Any in this query. This is straightforward enough when used as above, but it can be a difficult statement to understand when used as a negation. For example, how would you interpret the following statement? Select * From Invoice I Where invoice_value <> ANY (Select total_amount From purchase_order); This is deceiving, but it is really the same as saying that if at least one row in the subquery is false, the result is false. This can be rewritten as Select * From Invoice I Where invoice_value <> SOME (Select total_amount From purchase_order); Select * From Invoice I Where invoice_value = ALL (Select total_amount From purchase_order); All three examples are the same, but the last two are easier to interpret from a logic and language point of view. -------------------------------------------------------------------------------- ALTERING A TABLE TO SET DEFAULTS After a table has been created, default values for columns can be changed. This will force a column to have the value specified by default if a value was not included in an Insert statement. The syntax is as follows: Alter Table table_name alter column_name Set Default value; Alternatively, a default can be dropped using the following syntax: Alter Table table_name alter column_name Drop Default; Here is an example: Alter Table Product alter product_category_cd Set Default 123; This will cause all new Insert statements that do not specify a value to be set to 123. It will not set any rows that already exist in the database. -------------------------------------------------------------------------------- ADDING A COLUMN TO A TABLE Columns can be added to existing tables using the alter table ... add... syntax. The columns are added to the end of the table and are immediately useable. When adding columns to tables that already contain data, the new columns must allow nulls or must have a default value. Once they have been populated with data, the new column can be modified to not null if you require that constraint. Here is an example of a new column being added to the Customer table: Alter table Customer add address_line_1 varchar(80) null -------------------------------------------------------------------------------- A SIMPLE PROGRAM WITH EMBEDDED SQL SQL is written in a host language, such as C or C++. The SQL statements are included with the other source program statements. Program variables can be used in the SQL statements. For example, they can be used as variables in Where clauses. The output from SQL statements can also be moved to variables for further processing in the program. A simple example of embedded SQL in a C program is shown below: main() { exec sql include sqlca; exec sql begin declare section; int vCustomerNo; char vCustomerName[30]; exec sql end declare section; whenever sqlerror goto error_rtn; vCustomerNo = 1; exec sql select customer_name from Customer where customer_number = :vCustomerNo into :vCustomerName; error_rtn: printf("SQL Error"); exit(); } This shows how a variable vCustomerNo is set and used in the Where clause, and how another variable, vCustomerName, is used to accept the output from the statement. When a SQL error is encountered, the procedure error_rtn is called and can then perform whatever error processing is required. This program is run through a precompiler to prepare the SQL executable code, and then through a compiler to prepare the executable module. The precompiler looks for the tag exec sql...; in order to find the SQL statements. -------------------------------------------------------------------------------- A SAVEPOINT EXAMPLE As we discussed in our previous tip, a transaction Rollback can be performed to a Savepoint with a statement similar to this: Rollback to Savepoint save1; The pseudo-code for this may look like Begin Transaction application code part 1 Savepoint save1 application code part 2 IF (condition1 = TRUE) THEN Rollback to Savepoint save1 ELSE perform_more_actions; Commit; End Transaction Given the above pseudo-code, if Condition1 is True, then the transaction is rolled back to save1 and the rest of the transaction (before save1) is committed. If, however, condition1 is False, more actions are performed and the entire transaction is committed. There are times when you'll want to roll back a few statements while preserving the rest of your uncommitted work--and it is in these situations that you'll use the Savepoint feature. -------------------------------------------------------------------------------- A PLACE FOR CONCATENATED KEYS In our previous tip, we discussed many disadvantages of concatenated keys. We used the example of a relationship from a Party table to Customer to Customer_Address, where the Customer_Address primary key was the concatenation of the Party_Id, Customer_Id, and Customer_Address_Id columns. One advantage of this approach becomes apparent in the case where we want to join Customer_Address directly to the Party table, or where we want to join it to another table that's related to Party. Having Party_Id on Customer_Address allows us to avoid at least one read-through the Customer table to get to Party. We could, of course, avoid this by denormalizing Party_Id to Customer_Address. However, doing so has its own integrity issues. The concatenated key can be handy when traversing complex models with SQL, but this design option should be used sparingly and only after the disadvantages have been weighed and discounted. -------------------------------------------------------------------------------- A PERSONAL PET PEEVE I continue to be amazed at the way people are using the Microsoft Access database to build relatively large systems. I have seen many multiuser systems that run in a production mode being implemented in Access. Don't get me wrong--Access is a great tool, easy to use and very powerful. It's just that when choosing a tool for a job, it's important to understand its strengths and limitations. My rule of thumb is that Access should be used for very small or personal systems. Any data that is production data and that supports more than one person, however, should be implemented in a more robust, multiuser DBMS. In this case, you'll need a database that scales well, provides hot and cold backup as well as point-in-time recovery, and has advanced security features and an advanced locking implementation. Examples of databases that include these features are Sybase (Adaptive Server), Oracle, DB2, Informix, and Microsoft's SQL Server. -------------------------------------------------------------------------------- TRANSACT-SQL PROCEDURE TO EXPAND HIERARCHY Transact-SQL supports recursion, but the procedure outlined below is more efficient since it uses a temporary table as a stack and uses it to keep track of items that are being processed. Once an item has been completely processed, it is removed from the stack and another item is added. As with the approaches discussed in previous tips, a starting place must first be established. In this case, a variable, @current, is used to establish the starting place and keep track of the current positioning in the stack. The @level variable tracks the current level in the hierarchy, and @line creates an indented line. The level variable in the #stack table allows the same item to appear at many levels. An example of this procedure (borrowed from the SQL Server documentation) is shown below. To view this procedure, go to the online SQL Server documentation and search using Hierarchy. From the returned list, choose Expanding Hierarchies. As shown in our previous tip, here is the hierarchical result we seek with the expand procedure shown below: World North America Canada United States Washington Redmond New York New York City Europe France Paris Here, then, is the procedure: CREATE PROCEDURE expand (@current char(20)) as SET NOCOUNT ON -- do not show rowcount messages after each select DECLARE @level int, @line char(20) CREATE TABLE #stack (item char(20), level int) INSERT INTO #stack VALUES (@current, 1) SELECT @level = 1 WHILE @level > 0 BEGIN IF EXISTS (SELECT * FROM #stack WHERE level = @level) BEGIN SELECT @current = item FROM #stack WHERE level = @level SELECT @line = space(@level - 1) + @current PRINT @line DELETE FROM #stack WHERE level = @level AND item = @current INSERT #stack SELECT child, @level + 1 FROM hierarchy WHERE parent = @current IF @@ROWCOUNT > 0 SELECT @level = @level + 1 END ELSE SELECT @level = @level - 1 END -- WHILE The SQL Server documentation can best describe the procedure: "When @level is greater than 0, the procedure follows several steps. If there are any items in the stack at the current level (@level), the procedure chooses one and calls it @current. It indents the item @level spaces, and then prints the item. The procedure then deletes the item from the stack so it won't be processed again, and then adds all its child items to the stack at the next level (@level + 1). This is the only place where the hierarchy table (#stack) is used. If there are child items (IF @@ROWCOUNT > 0), descends one level to process them (@level = @level + 1); otherwise, continues processing at the current level. Finally, if there are no items on the stack awaiting processing at the current level, goes back up one level to see if there are any awaiting processing at the previous level (@level = @level - 1). When there is no previous level, the expansion is complete." -------------------------------------------------------------------------------- STORAGE ISSUES: RAID DEFINED When designing and implementing a database, you will be faced with a choice of storage implementations and will often choose a level of RAID that supports your system's requirements. Because of this, it is important to understand RAID and the different popular RAID types. RAID was first introduced through a paper at the University of California at Berkeley in 1987, and the acronym stands for Redundant Array of Inexpensive Disks. This was an architecture that was designed to decrease the price of disk storage; increase load balancing and concurrence through the striping of drives; and improve availability through redundancy. In the next several tips, we'll examine the most popular RAID architectures: RAID-0, 1, 3, 5, S, 7 and 0+1. -------------------------------------------------------------------------------- RECURSIVE SQL IN DB2 AND CONTROLLING DEPTH Continuing with our series on recursive queries in DB2 (and continuing with an example from IBM's SQL Reference "Appendix M. Recursion Example"), let's look at how to display a specific number of levels in a recursive query parts explosion (in other words, how to display all subparts of a part up to and including the second level of depth). To do this, you add a Level column to the RPL common table expression that will add 1 with each subsequent level returned. An example of this is shown below: The table is defined as follows: CREATE TABLE PARTLIST (PART VARCHAR(8), SUBPART VARCHAR(8), QUANTITY INTEGER); The common table expression is WITH RPL (Level, PART, SUBPART, QUANTITY) AS ( SELECT 1, ROOT.PART, ROOT.SUBPART, ROOT.QUANTITY FROM PARTLIST ROOT WHERE ROOT.PART = '01' UNION ALL SELECT PARENT.LEVEL + 1, CHILD.PART, CHILD.SUBPART, CHILD.QUANTITY FROM RPL PARENT, PARTLIST CHILD WHERE PARENT.SUBPART = CHILD.PART AND PARENT.LEVEL < 2 ) SELECT PART, LEVEL, SUBPART, QUANTITY FROM RPL; -------------------------------------------------------------------------------- RAID-0 It can be argued that RAID-0 is not really RAID, since it does not honor the first letter of the acronym, which stands for Redundant. This is the only level that offers pure striping without redundancy. This striping can greatly increase performance, but it reduces the mean time before failure (MTBF) since a block of data is striped (spread) across many drives and the failure of any one drive causes the system to fail. There are few cases where you will want to implement this on its own. -------------------------------------------------------------------------------- HIERARCHIES VIA TRANSACT-SQL: INTRODUCTION In the past few tips, we've discussed the implementation of hierarchies in DB2 and Oracle. The navigation of hierarchies can also be accomplished through Transact-SQL. In this tip and the next, I will use an example taken from Microsoft's SQL Server version 7 SQL documentation. The procedure takes a list of items and creates a hierarchy of them, as shown below. For this list of items: PARENT CHILD World Europe World North America Europe France France Paris North America United States North America Canada United States New York United States Washington New York New York City Washington Redmond Transact-SDL can create this hierarchy: World North America Canada United States Washington Redmond New York New York City Europe France Paris In our next tip, we'll discuss the Transact-SQL code that creates this hierarchy. -------------------------------------------------------------------------------- RAID-1 With RAID-1, all data is written to two disks. This RAID level is also known as mirroring, or shadowing. Advantages include increased MTBF (mean time between failure) and fast data reads. MTBF increases since the failure of one disk does not stop a system from running--the system can continue to operate with the alternate, mirrored disk. Fast reads are also a benefit because the read can be performed from either disk and the least busy one will be used. Disadvantages include the increased write time (since each write must be duplicated to two disks) and increased cost of purchasing the doubled amount of storage. -------------------------------------------------------------------------------- RAID-3 RAID-3 provides striping across several disks and adds an extra disk where all parity information is written. The loss of any one disk can be tolerated, because all data can be rebuilt from the remaining disks. The loss of two disks, however, cannot be tolerated. This RAID level provides a decent MTBF (mean time between failure). The data striping can create a performance advantage, but there is a hot spot with this RAID level, because one drive provides all of the parity information. This RAID level is usually passed over for RAID-5, discussed in our next tip. -------------------------------------------------------------------------------- RAID-5 This level of RAID differs from mirroring and is similar to RAID-3 in that it computes and writes parity information that can be used to rebuild data in the event of a loss of one drive in the array. The difference between this and RAID-3 is that the parity information may be written to any drive in the array, removing the single-drive hot spot that is encountered with RAID-3. With RAID-5, the data is striped across the drives and the parity information for the strips is written to an additional drive. There is, however, one large disadvantage to this, referred to as the write-penalty: Whenever a write occurs, the parity data needs to be read, re-computed, and written with the related data while the data and parity drives are locked. RAID-5 is excellent for read-only systems, such as data warehouses, and are useable for OLTP systems that do not incur a high transaction volume. RAID-5 uses only approximately 20 percent overhead (this percentage varies depending on the configuration of the drives) rather than the 100 percent overhead that mirroring incurs. Hardware-implemented RAID-5 performs much better than software implementations, and write-cache can decrease that pesky write-penalty. As you'll see in our next tips, implementations of RAID-7 and RAID-S can improve upon RAID-5. -------------------------------------------------------------------------------- RAID-S This is a modified version of RAID-5. RAID-S is different from RAID-5 in that it stripes the parity, not the data, and it implements a write-cache to remove the RAID-5 write penalty. To implement data striping, third-party software needs to be implemented. Without this extra software, you may experience bottlenecks when data is being written. The write-cache improves performance by deferring writes and performing parity calculations at a time when the system is less busy. -------------------------------------------------------------------------------- STORAGE TRENDS: SAN There are some interesting new storage trends on the horizon, the first of which is called Storage Area Networks (SAN). SAN is a network of storage devices that manage and store data without tying up servers. A centralized SAN allows you to tie multiple servers to a single storage system. This arrangement provides you with greater flexibility in managing your corporate storage. Cabling distances allow for campus-wide as well as metropolitan areas. Networking speed and overhead is critical to the success of this type of implementation. SAN technology is still relatively immature but is expected to be a $12 billion industry by 2002. In the meantime, SANs are relatively expensive and have a few issues, such as the need for serverless backup (i.e., a direct backup from storage to tape without need for an intervening server) and secure access that allows cross-platform servers to recognize space allocated to them (to prevent problems like Solaris data overwriting Windows NT data). Some advantages of SANs are that they perform at 100 Mbytes per second without message traffic, offer centralized management, and are highly scaleable. The long-term upside for SANs is high. -------------------------------------------------------------------------------- STORAGE TRENDS: NAS Storage Area Networks (described in our previous tip) can be compared to Network-Attached Storage (NAS) boxes, which are storage servers that attach to a LAN. NASs are connected to the network in the same manner as file servers, and adding or removing a NAS system is similar to adding and removing a network node. The positives of NAS are that the technology is here today, they deliver direct backup, and they may perform fine on gigabit LANs. The downside of NAS is that scalability is accomplished by adding more servers to your environment--which incurs the overhead of passing data, messaging, and Ethernet traffic across your LAN. This is expected to be a $6 billion industry by 2002. -------------------------------------------------------------------------------- SQL SERVER: UNIQUE IDS In SQL Server, unique IDs for a table can be implemented with a column's Identity property. There is, however, another option, in the form of the uniqueidentifier datatype using a ROWGUIDCOL property. You can have only one of these on a table (as with Identity), and this is a globally unique identifier (GUID). The column, which takes up 32 hexadecimal characters, is 16 bytes wide. When displayed, it is separated by four dashes. To insert a new row, use the NEWID() function and SQL Server will automatically generate a new unique ID, as shown here: Create table Party (party_id uniqueidentifier, party_name char(50)); Insert into Party values NEWID(), 'new party'); Uniqueidentifier allows null values, so this should be implemented as Not Null in order to create a primary key on the uniqueidentifier column. -------------------------------------------------------------------------------- RAID-7 Another modified version of RAID-5, this one differs by implementing a write-cache. With this implementation, data is striped and the parity data is not. The parity data is held on one or more drives. RAID-7 and RAID-S are just two examples of enhanced performance being applied to the RAID-5 architectures. Most RAID-5 implementations are now delivering write-cache features. -------------------------------------------------------------------------------- RAID-0+1 This is also referred to as RAID-10. It includes both data striping (RAID-0) and mirroring (RAID-1). It is an expensive implementation due to the number of drives that must be purchased to support disk striping and the mirroring of every drive. It does, however, deliver the highest degree of performance and reliability. It is the recommended approach for implementing very high-end OLTP systems. -------------------------------------------------------------------------------- ORACLE: SHOWING HIERARCHIES WITH LPAD AND LEVEL In a previous tip, we showed you how to traverse a hierarchy in Oracle using the connect by and prior keywords. Here's the example we used: Select parent_column, child_column From Family_Tree start with child_column = 'Adam' connect by parent_column = PRIOR child_column; If our table contains the following:
Parent_column child_column Unknown Adam Adam Steve Adam Bill Bill Mary Steve Helen Mary Ann The result below will be returned, indented to show the effect of the hierarchy: Parent_column child_column Unknown Adam Adam Steve Steve Helen Adam Bill Bill Mary Mary Ann Now for the new stuff: This hierarchy can be indented in SQL using the Level and lpad keywords. Accordingly, the query that will visually represent a hierarchy using proprietary Oracle features becomes: Select parent_column, LPAD(' ',4*(LEVEL-1)) || child_column From Family_Tree start with child_column = 'Adam' connect by parent_column = PRIOR child_column; In this case, the statement LPAD(' ',4*(LEVEL-1)) will pad the child_column (to its left) with four spaces for every level of hierarchy. Oracle programmers commonly use this trick to display data in a more understandable manner. -------------------------------------------------------------------------------- NAS AND SAN: THE DIFFERENCES The terminology for Network-Attached Storage (NAS) and Storage Area Networks (SAN) can be confusing. Each is often compared to the other, but both fall under the "storage-network" umbrella. The most important difference between them is that a SAN is channel attached while a NAS is network attached. To database designers, this means that a SAN has the potential to achieve higher performance, particularly when data from a database is transactional and being retrieved at the block level. This unique use of storage by databases is different from other software (such as a word processor) in that it does not necessarily deal with a complete file of data at any given time. -------------------------------------------------------------------------------- ESTIMATING TABLE SIZE: THE ARITHMETIC Estimating the size of a table is half science and half art. The scientific part is very precise. You first need to determine the number of transactions that will insert, update, and delete rows on the table. This information can be retrieved through already-running systems that are being replaced by your new table(s) or from information gathered from your most knowledgeable business users and data modelers. Once you have this information, perform the following tasks: Determine the growth rate. For example: The table will grow at 10,000 rows per day. This should be done for a reasonable length of time, such as one year (if that is your budget cycle). You now know the number of rows that your table will need to accommodate. Determine the length of every row: Take the data type of every column and add them up. (Include null-indicator variables.) Estimate the average length of every varchar column and be aware of the manner in which your DBMS stores null values. For example, if a column is null, is the entire space for the column stored, or is a single-byte null indicator variable stored? Add this to the row overhead. Find out how row updates will increase row-length. This will impact the amount of space you will want to leave free (pctfree) to allow a row to increase in size without being moved to another block. Determine the number of rows that will fit into a block. To do this, take the block size (or page size) and subtract the block overhead. Once you've done this, subtract the amount of space free (pctfree from Step 3) that you want to maintain. For example, if your block size is 4096, block overhead is 96, and percent free is 10, then your usable space is 4000 * 0.9 = 3600 bytes. Divide this by row-length (from Step 2), which also includes row overhead. If row-length plus row-overhead is 100, then 36 rows can be stored on a block. Make sure that this number is less than the total number of rows that can be stored in a block. Take the lower of the two numbers. Divide the total number of rows (from Step 1) by the rows per block to find out how many blocks of data will be required. Once you've gone through all this, you'll have a pretty good idea of the amount of space your table will need. -------------------------------------------------------------------------------- ESTIMATING TABLE SIZE: DOUBLE UP In our previous tip, I outlined the steps to determine the amount of space that a table will take up. Armed with this information, you now need to decide how much space you should budget for. This is where you need to be creative. You now need to factor in your level of confidence on the number of rows and the growth predictions you were given. You also need to determine the best way to size varying length columns. Should you take the maximum size or the estimated average size? Will your percent-free estimates be sufficient? If not, your table may require frequent reorgs or space may be wasted. In either case, system performance will suffer. How big is the table? If it is very small, you can easily double or triple your predictions. When performing space estimates, play it safe and consider doubling your estimates. Remember that disk space is cheap, right? Well, you may find that not to be the case--but still, the additional space will be needed (and consumed) in short order, so don't shortchange yourself when it comes to disk space. -------------------------------------------------------------------------------- ESTIMATING INDEX SIZE The process of determining index size is very similar to that for table size: Determine the number of rows to be held by the index. Calculate the index length, including index overhead. Bear in mind that some databases store varying length columns at their full size in an index. Also, duplicate index entries may be stored differently than unique values--so you'll need to know how your database does this and estimate the number of duplicate values by the appropriate calculation. Unique indexes may also have different space requirements. Determine the number of index entries that will fit into a block. Divide the total number of rows (from Step 1) by the rows per block to find out how many blocks will be required for a specific index. Of course, you'll want to follow the calculations outlined for your DBMS, since they all differ, but these basic steps apply in all cases. -------------------------------------------------------------------------------- SQL-99: INCONSISTENCY IN ALL AND DISTINCT You've seen the use of the All and Distinct options in operations such as Union and Except. When using these operations, the default is Distinct. Therefore, a statement such as Select column_list from Table1 Union Select column_list from Table2; will return only distinct (i.e., unique) values. To return duplicates, code the All option as follows: Select column_list from Table1 Union All Select column_list from Table2; This differs from the All|Distinct option at the beginning of a Select statement where the default is All. Therefore, coding the following: Select column_list from Table1; is the same as Select All column_list from Table1; Bear in mind as you write SQL statements that these defaults are inconsistent with one another. -------------------------------------------------------------------------------- EXCEPT OPERATOR To find rows in one set that do not exist in another set, use the except operator (as defined in SQL-92 and SQL-99). For example, here's how you find column1 from Table1 that does not exist in column2 of Table2: Select column1 from Table1 Except Select column2 from Table2; The except operator will remove duplicates, and a single null value will be returned in the case of multiple null values. To return duplicates, use except all. Keep in mind, of course, that other proprietary implementations (such as Minus in Oracle) exist. -------------------------------------------------------------------------------- ESTIMATING INDEX SIZE: SPECIAL CONSIDERATIONS Though the steps for determining index size are similar to those for determining table size, you should be aware of some index peculiarities. For starters, some DBMSs (like Oracle) do not reuse index space for deleted rows--though they reuse data space for the same rows. So, do not assume that 100 row Deletes followed by 100 row Inserts will use that same total amount of space--the space will actually increase by the 100 new inserted entries. Check with your DBMS to see how it handles space from deleted rows. Also, index entries must be stored in logical order, but rows do not. Accordingly, frequent index splits may occur--thereby increasing space requirements and forcing the reorganization of indexes more frequently than table data. A final, and more important, consideration is that you may need to add new indexes (or add columns to existing indexes) once the system is in production. Therefore, take into consideration your growth factor for the table/index as well as your degree of confidence in the index entries you have designed and created. -------------------------------------------------------------------------------- DATA INTEGRITY: REFERENTIAL INTEGRITY In our previous tip, you saw that data integrity is defined in the relational model through two rules. The first is entity integrity, and the second is referential integrity. Referential integrity states that every value of a foreign key must match a value in the primary key--or it must be null. A foreign key may also match the value in a unique key. In this case, the not null columns of the foreign key must match the not null columns of the primary key. -------------------------------------------------------------------------------- DATA INTEGRITY: ENTITY INTEGRITY Data integrity in the relational model is defined by two rules, the first of which is entity integrity. Entity integrity requires that every value in a primary key of a table must be unique and not-null. This allows a case of a multicolumn primary key, but none of the primary-key columns may be null (keep in mind that different vendors implement this restriction in different ways). Your applications may implement many business rules that you consider to be under the envelope of data integrity, and you would be correct. However, this is included outside the relational model. In our next tip: the second rule. -------------------------------------------------------------------------------- R.I: DEFERRABLE AND NOT DEFERRABLE When defining a constraint through declarative R.I., you may define constraint attributes as Deferrable or Not Deferrable. With Deferrable, the R.I. constraint checking is not performed at the time of an action, such as an Update or Delete, but rather is deferred until the end of the transaction--or commit time. The Not Deferrable action will perform the checking at the time of the action. You should be extremely careful when implementing the Deferrable option, since all checking is performed at transaction commit time. If a constraint check fails, the entire transaction is rolled back--even though you have requested a commit transaction--which is very misleading. Also, the Not Deferrable clause allows you to take corrective action at the time of the statement and either correct the problem, or perform an explicit Rollback of the transaction, allowing you to properly handle errors and exceptions. -------------------------------------------------------------------------------- R.I.: SET DEFAULT ON DELETE When deleting a parent row with a primary key, you may want to set the foreign key columns to their default values, rather than to null. To do this, use the set default action and be sure to define default values for the foreign key columns. Here is an example: Create table Table1 (column1 smallint not null, column2 smallint) Constraint Table1_P Primary Key (column1); Create table Table2 (column_A smallint, column_B smallint Default 0); Alter table Table_2 add constraint Table1_Table2_fk Foreign Key (column_B) references Table1(column1) on Delete Set Default; -------------------------------------------------------------------------------- R.I.: RULES AND UPDATE Previously, we've seen the database-implemented R.I. actions for Delete operations on a parent table. It is important to know that, although many DBMS vendors do not implement it, you can also implement Update R.I. rules as specified by the SQL standard. For example, we've seen the following Delete actions for a foreign key definition: Create Foreign Key ......... On Delete <action>...; where <action> is one of the following: no action, cascade, restrict, set null, set default. According to the SQL-99 standard, you can use all the above actions with an Update statement, so the update of a unique key value can apply the same rules: no action, cascade, restrict, set null, set default. -------------------------------------------------------------------------------- R.I.: DELETE RESTRICT ACTION There are cases where you do not want to delete a (parent) row if a child row exists with a foreign key that points to it. In this case, you want to disallow the Delete using the Delete Restrict action. To delete the row, you must first delete all foreign keys that point to the primary key row; once they have been deleted, you can delete the primary key row. Here is an example of a Delete restrict that creates a primary key on the parent table and a foreign key on the child table: Create table Table1 (column1 smallint not null, column2 smallint) Constraint Table1_P Primary Key (column1); Create table Table2 (column_A smallint, column_B smallint); Alter table Table_2 add constraint Table1_Table2_fk Foreign Key (column_A) references Table1(column1) on Delete Restrict; -------------------------------------------------------------------------------- R.I. RESTRICTIONS Database-implemented R.I. is restrictive in that it can enforce only certain relationships. It can implement the rule that a foreign key must have a matching primary key, for example, and it can implement Delete and Update rules on the primary key. If, however, you have a fully mandatory relationship where the primary key must have at least one foreign key (and vice versa), declarative R.I. cannot help. Instead, you will need a trigger to enforce this type of relationship. This will also not help stop you from deleting the last foreign key (child) row in a fully mandatory relationship. Fortunately, this type of relationship is not as common as a relationship that is only mandatory on the foreign key side. -------------------------------------------------------------------------------- R.I. DELETE SET NULL ACTION A foreign key may be created on a child table using the set null action. When this is implemented, a Delete of a parent row (the row containing the primary key) will set all foreign key columns to null. In other words, the child row is not deleted but still exists with a null foreign key. This can be used to implement an optional relationship where the child row may have one or no parent row. Here is an example of a Delete set null that creates a primary key on the parent table and a foreign key on the child table: Create table Table1 (column1 smallint not null, column2 smallint) Constraint Table1_P Primary Key (column1); Create table Table2 (column_A smallint, column_B smallint); Alter table Table_2 add constraint Table1_Table2_fk Foreign Key (column_A) references Table1(column1) on Delete set null; -------------------------------------------------------------------------------- R.I. AND INDEXING The implementation of primary and unique key constraints enforces the use of indexes to implement the constraint. You may, however, implement foreign key constraints without creating indexes to cover the foreign keys. For example, I have seen cases where Delete operations appear to run for a long time--which can happen if the foreign keys are not indexed. If a Delete is performed on a primary key and a lookup is performed to the foreign key, a table scan will be performed on the child (foreign key) table unless an index was created on the foreign key columns. This index should contain one foreign key column--preferably the most restrictive column--at the beginning of the create index column list. This index will help cascade Deletes and Updates as well as Joins statements. -------------------------------------------------------------------------------- DECLARATIVE R.I. AND DELETE CASCADE When defining declarative database R.I., one option is to implement the rule as Delete Cascade. With this option, a delete of a parent row (the row containing the primary key) will also delete all foreign key rows. Here is an example of a Delete Cascade that creates a primary key on the parent table and a foreign key on the child table: Create table Table1 (column1 smallint, column2 smallint) Constraint Table1_P Primary Key (column1); Create table Table2 (column_A smallint, column_B smallint); Alter table Table_2 add constraint Table1_Table2_fk Foreign Key (column_A) references Table1(column1) on Delete Cascade; When implementing this option, it's important that a multilevel Delete has consistent R.I. actions. If you have a three-level hierarchy and the first level implements a Delete Cascade and the second level implements Delete Restrict, the Delete will be rolled back--which may not be your desired result. -------------------------------------------------------------------------------- DATABASE IMPLEMENTED R.I. Referential integrity (R.I.) may be implemented in one of three ways: In the database using declarative R.I. rules In the database using triggers In application code Database implemented R.I. through declarative (DDL) rules is the preferred approach, since it is implemented in the database schema and the rules are consistent and well understood. There are cases, however, where the required rules are more complex than declarative R.I. will allow and need to be implemented in triggers--as are there cases where declarative R.I. does not perform well. Keep in mind that mixing declarative R.I. with R.I. implemented in triggers can be dangerous since the order in which both rules may be executed needs to be clearly understood. It's important to know exactly when a declarative R.I. rule is being performed in relation to your trigger. You may also incur a performance penalty since your trigger and declarative R.I. may both be accessing the same data. If even one trigger is required to implement declarative R.I., consider implementing all R.I. in the triggers. The final option is to implement R.I. in your application code. This is a last resort since the R.I. is not built into the database. It is often practiced by software vendors who need to implement R.I. in a consistent way across diverse DBMS types. It was also practiced in the past by DBMSs that did not support triggers (like older versions of DB2). -------------------------------------------------------------------------------- APPLICATION IMPLEMENTED R.I. Referential integrity (R.I.) can be implemented without using either database triggers or declarative R.I--as is the case when referential constraints are implemented by application code. This is usually performed because the database features do not deliver the capabilities required by a system, or because the system needs to be portable across different database platforms. When you are implementing R.I. in the application, it's important to be aware that the R.I. rules can be bypassed if the proper application code is not used to make database changes. Because of this, the following points should be considered when using application R.I.: R.I. rules need to be well documented. Rules should be implemented in a consistent manner. Database security should be implemented to ensure that database access is performed only through the proper application code. Implement quality assurance checks that can be run on a regular basis to ensure that the database integrity is intact. Though this is not a preventive check, it can help warn you of potential problems in a relatively timely manner. In short, application-implemented R.I. should be performed only as a last resort. -------------------------------------------------------------------------------- WEB RESOURCES One of the great advantages of the Internet is the amount of freely accessible information we have at our fingertips. Most database vendors, for example, provide full sets of manuals online. I use these regularly, and generally the response time is reasonable with a high-speed connection. (These are, after all, database vendors--and if their Web sites are slow, you should feel free to complain to them.) When I want to learn more about a database's features, I go right to the source. Here, then, are some popular vendor sites that include online manuals, white papers, and other goodies: IBM Software: DB2 Product Family http://www.ibm.com/db2 Oracle Technology Network http://technet.oracle.com/ Oracle Corporation http://www.oracle.com/ Microsoft TechNet: SQL Server Technology Center http://www.microsoft.com/technet/sql/ Sybase Inc. http://www.sybase.com/ Informix Corporation http://www.informix.com/ -------------------------------------------------------------------------------- TWO-PHASE COMMIT In previous SQL tips, you've seen the terms "transaction" and "commit" used to represent the implementation of a unit-of-work. You may have heard about another, more robust type of commit, but maybe you are not clear as to what it is: the two-phase commit. This commit is used when more than one system is involved in the coordinated commit, rollback, and recovery of a transaction. In the case of two-phase commit, one system is the transaction manager and coordinates the commits across more than one system to ensure that the transaction is successful everywhere before it is finally committed. Examples of where this is used include: Distributed database transactions TP-monitors or transaction managers, such as MTS, CICS, Encina, Tuxedo, and Top End TP-monitors used with databases and other software, such as queuing systems like IBM's MQ-Series -------------------------------------------------------------------------------- TRANSACTION TIPS Below, you will find a handful of performance-improving suggestions to keep in mind as you design your transactions: Commit or roll back as soon as possible after inserts, updates, and deletes. This will release locks and flush logs sooner, improving overall system performance. Place inserts, updates, and delete statements as close to one another as possible to reduce locking time. Commit or roll back slowly after selects since required items may be released after the commit. (Open cursors, for example, are among those resources that may be released.) This assumes that your select statement is not locking other transactions. If you have the choice, perform selects before updates. The reason for this is that selects after updates may be slower--particularly if features such as Oracle rollback segments are used. Use temporary tables and tablespaces for temporary data. These will not require logging and are extremely efficient. Always perform explicit commit or rollback operations. Do not count on the termination of a program to implicitly perform these operations for you. -------------------------------------------------------------------------------- TRANSACTION DEFINED One of the most important concepts in the database arena is that of the transaction. To explain precisely what a transaction is, I'll quote the SQL standard: "An SQL-transaction (transaction) is a sequence of executions of SQL-statements that is atomic with respect to recovery. That is to say: either the transaction result is completely successful, or it has no effect on any SQL-schemas or SQL data." From: SQL-99 Complete, Really, Gulatzan & Pelzer, CMP Books, 1999 -------------------------------------------------------------------------------- THE PHASES IN TWO-PHASE COMMIT Believe it or not, there are two phases in two-phase commit. (Imagine that!) The first phase is the preparation phase--when the transaction coordinator polls all the systems asking them if they are ready to commit. The participating nodes are told to be ready to commit using the presumed-abort protocol so that the nodes will be ready to roll back if there is a problem. The systems will all come back and respond yes, no, or that they are in read-only mode and do not need to participate. If the answer is no, the transaction does not continue. If they all respond yes or read-only, then all systems have sent the prepare message to their log and are ready to commit. This ends phase one. The distributed transaction then enters phase two--when the final, global commit occurs. This assumes success since all systems are ready to commit. This second phase is a series of communications between the coordinator and the committing systems that must perform their own commits and finally inform the transaction coordinator that the transaction is complete. -------------------------------------------------------------------------------- MIXING ACID AND DATABASES In your database travels, you may have come across the acronym ACID or the phrase "ACID requirements." This refers to the need for a database system to meet all of the ACID properties, which are: A is for Atomic, which means that every atomic or low-level transaction must be completely consistent within itself. A transaction cannot be broken up in a manner that allows part of a logically consistent transaction to be committed and another part to be rolled back. C represents the Consistent nature of a transaction; data must be consistent at the beginning of a transaction, at the end of a transaction, and during a transaction. I means Isolated, which describes the fact that each and every transaction must be isolated, should not have an effect on other transactions, and should not be affected by other transactions. D stands for Durable, meaning that once a change is made it is permanent or persistent (not temporary) in nature. Durable may not be the best word to describe this, but it makes for a nice acronym. -------------------------------------------------------------------------------- MANUALLY RESOLVING IN-DOUBT TRANSACTIONS A transaction that is participating in a two-phase commit that has passed phase one but not phase two is considered to be in doubt. This transaction has been "prepared" at every site participating in the two-phase commit but has not been committed. At this point, we will assume that the transaction is going to take place and succeed. What happens to our database if we have a network error that prohibits messages from being sent to or received from the coordinator at this point? The transactions that should be committed are not, and database data will remain locked. This could cause unnecessary database problems. In certain cases, you may be able to resolve this manually. One example uses this DB2 command: Recover Indoubt Action(Commit) ID(1031) This command will commit the Indoubt transaction and allow the database to continue processing. This manual intervention should be performed only in extreme cases where you are absolutely certain that a transaction is causing system problems and where you are certain that intervention will not create any data integrity issues. -------------------------------------------------------------------------------- LOB STORAGE IN ORACLE Oracle supports four types of Large Objects (LOBs). These are: BLOB (Binary Large Object): Can store binary data in the database up to 4 GB in length. It can be used to store an image, for example. CLOB (Character Large Object): Can store data in the database up to 4 GB. It might be used to store a document, for example. BFILE: A binary file that is read-only and stored outside the database. A table in the database will have a pointer to the externally stored BFILE. This is useful if the data is already stored elsewhere and if you want to share that external storage with another system. NCLOB: A CLOB that supports a multibyte character set. These are not to be confused with Oracle's older, and still supported, implementation of Long, Raw, and Long Raw--which are limited in their functionality and allow only one long column per table, whereas multiple LOB columns can be stored in a table. Among other things, these data types are useful in the implementation of user-defined types. -------------------------------------------------------------------------------- LOB STORAGE IN DB2 DB2 supports three types of Large Objects (LOBs) in a manner similar to Oracle: BLOBs, CLOBs, and DBCLOBs. BLOB (Binary Large Object): Can store up to 2 GB. It can be used to store data such as images, audio, and mixed-media types. CLOB (Character Large Object): Can store up to 2 GB of data. It can be used for single-byte or multibyte character strings that store documents written in a single-character set. DBCLOB (Double Byte CLOB): Stores up to 1 GB of double-byte character data. It can be used to store graphics. These data types are quite a bit more powerful than DB2's Long Varchar and Long Vargraphic, which can hold only up to 32 KB and 16 KB, respectively. -------------------------------------------------------------------------------- LOB ME ONE OF THOSE LARGE OBJECTS Traditional, older versions of databases stored structured data types that were small and clearly defined. Examples of these data types include integer, smallint, number, decimal, varchar, character, date, time, and timestamp. These databases, of course, still support those data types, but they have also added Large Object (LOB) support, which allows you to store larger, more unstructured data types. LOBs provide the capability to store data types such as images, audio, video, documents, geo-spatial data, and user-defined data types. The user-defined data types allow for the definition of data types and rules to implement such complex data types as fingerprints and geo-spatial data. LOBs can be characterized as Binary Large Objects (BLOBs) or Character Large Objects (CLOBs). The leading databases have different implementations of LOBs, and we'll take a look at some of them in upcoming tips. -------------------------------------------------------------------------------- LOBS IN SYBASE AND MICROSOFT SQL SERVER Sybase and Microsoft have similar Large Object (LOB) implementations. Sybase has two large object data types. The first is TEXT and stores up to 2 GB of character or text data. A second, the IMAGE data type, can also store up to 2 GB of binary data. These data types vary in length and are similar to Oracle and DB2's CLOB and BLOB data types. Microsoft SQL Server adds a third data type, NTEXT, a double-byte large text object that can store up to 1 GB of data. It is similar to Oracle and DB2's NCLOB and DBCLOB data types, respectively. -------------------------------------------------------------------------------- LOBS IN SQL-99 SQL-99 (formerly SQL-3) defines the SQL standard to support two Large Object (LOB) data types. These are Character Large Objects (CLOBs) and Binary Large Objects (BLOBs). Data stored as a BLOB or a CLOB is to be stored in the database rather than externally. These data types are "special" and do not need to support all of the standard functionality of other, structured data types. For example, comparison operators that are built into the database (like =, <, >) are not supported. Instead, you are given the capability to extend these data types by providing your own functions, comparison operators, and optimization routines. These data types do, however, come with extra functionality that the smaller, non-LOB data types do not have: the ability to manipulate the BLOBs and CLOBs with functions that allow us to retrieve, replace, concatenate, and perform like operations. As you can see from this and previous tips, the major database vendors have implemented these standards and extended the SQL-99 LOB functionality. -------------------------------------------------------------------------------- BLOB STORAGE: IN-LINE AND OUT-OF-LINE BLOB, CLOB, and NCLOB segments can be stored in-line or out-of-line--meaning that they can be stored in the same tablespace as the table where the LOB is defined or in a separate tablespace. In-line storage means that the LOB column is stored in a row in the same segment as the rest of the table. Out-of-line means that the table row contains a small pointer to the LOB data that is stored in another container or segment on a separate tablespace. Storing a LOB in-line should be performed only in very extreme situations where the LOB is small and can't grow, and where the increased row size won't severely affect performance. Remember that storing the LOB in-line will require that table scans read many more blocks of data than they would if the data were stored out-of-line. Also, in-line storage usually has very strict limitations on how large the LOB data can be. The out-of-line option gives you more scalability and tuning options. -------------------------------------------------------------------------------- WHEN NULL VALUES ARE EQUAL Consider the following three statements: If either value in a comparison (=) is Null, the result is unknown. A null is never equal to another null value. A null is neither equal to nor is it not equal to another value. The above statements are all true. There is, however, one exception to the above: duplicates. When you use the Distinct function, two null values will be grouped as one. Also, a unique index on a null column will allow only one null value (if it is a single-column index, or if other uniquely indexed columns from this row match the values of the same columns on another row), and the second will be considered a duplicate. Keep this in mind when grouping data and creating unique indexes. -------------------------------------------------------------------------------- WHEN A DATABASE IS NOT A DATABASE The next several tips will look at differences between Microsoft's SQL Server and Oracle. In today's installment, you'll see that, amazingly, the term "database" does not mean the same thing for both DBMSs. An Oracle database includes all components of a single server installation. In other words, the processes, database and log buffers, SGA, System, rollback segment, temporary, tools and user tablespaces, online and archive logs, and other related files. An Oracle database has many different "owners" that can be used to separate logical system or application components. An SQL Server database, on the other hand, is part of a single SQL Server install. In other words, an SQL Server install includes the Master database as well as other user databases. An implementation of an Oracle database is similar to the implementation of a single SQL Server Engine. A SQL Server database can be thought of as being similar to an Oracle Owner or Schema. When the term "database" means two different things for two of the major DBMSs, you know that other large differences must exist--and we'll take a look at some of these in upcoming tips. Some of this information was gleaned from the following white paper: Migrating Oracle Applications to SQL Server by Miaco Corporation, The Database Experts, Denver Co. http://www.microsoft.com/TechNet/sql/Tools/Sqldevkt/ORCL2SQL.asp -------------------------------------------------------------------------------- WATCH OUT FOR AUTO-COMMITS You've read about transactions in previous SQL tips and how you should commit or roll back all SQL statements related to a single, atomic business transaction. There are, however, some dangerous examples where commits are performed implicitly without your being aware of it. One such example is the performing of DDL statements that automatically commit once complete. Another, more insidious one is the default in ODBC that performs an automatic commit (auto-commit) after every SQL statement. This is a violation of both the SQL standard and common sense. Please be aware of this behavior when using ODBC, and change the default. To perform manual commits, explicitly start a transaction and issue the commit or rollback. -------------------------------------------------------------------------------- TRUNCATE TABLE IN ORACLE AND SQL SERVER There is one statement that is similar (though not exactly the same) between Oracle and SQL Server: The truncate table statement is similar for both DBMSs and allows storage to be reclaimed in tables and indexes. The syntax is shown below: Truncate Table table_name; Oracle provides the seldom-used option to either "drop storage" (the default) or "reuse storage." Note that delete statements do not drop the storage used by a table. -------------------------------------------------------------------------------- TRANSACTIONS IN ORACLE AND SQL SERVER An Oracle transaction begins when the first insert, update, or delete is encountered and ends at commit or rollback time. If a commit is not performed, the transaction will implicitly roll back. SQL Server operates in a different manner: By default it issues a commit after every insert, update, or delete. This can be changed by setting 'implicit' or 'explicit' transaction modes. Using the implicit mode sets SQL Server to operate in the same manner as Oracle and is set by "set implicit_transactions on". Explicit transactions in SQL Server are groupings of statements bounded by begin transaction on the one end and commit, rollback, or save transaction on the other. -------------------------------------------------------------------------------- THE COMPUTE CLAUSE IN SQL SERVER The compute clause is used in SQL Server to perform aggregation functions such as sum, avg, count, max, and min. This allows you to view both the detail values as well as the aggregate values in a single query. This essentially adds totals and subtotals to queries. Here is an example: Select cust_name, cust_sales From Customer Compute sum(cust_sales); The above query delivers a result similar to the following: CUST_NAME CUST_SALES IBM 1.23 ORACLE 2.01 MicsoSoft 3.00 SUM 6.24 -------------------------------------------------------------------------------- TEMPORARY TABLES IN SQL SERVER SQL Server has very strong temporary table support and supports three types of temporary tables: If the table is created in tempdb as tempdb..table_name, the table persists until it is dropped and users must be granted access. A global temporary table exists for the duration of the session of the procedure that created it, and this type of table is called ##table_name. The data in a global temporary table can be shared by multiple users for the duration of the session. A local temporary table exists for the same length of time as a global temporary table, but it is useable and visible only by the user that creates it. -------------------------------------------------------------------------------- TEMPORARY TABLES IN ORACLE Oracle and SQL Server both support temporary tables. Oracle, however, supports Global Temporary Tables. In this case, the table is temporary and its definition is visible to all oracle sessions, BUT the data in the table is visible only to the session that inserts the data. The data in this table persists only for the session or the transaction. The "on commit" keyword states whether the data is transaction- or session-specific. "On Commit Delete Rows" states that this is transaction-specific and that the table will be truncated at commit time. "On Commit Preserve Rows" is session-specific, and the table will be truncated when the session ends. Here is an example from Oracle's documentation: CREATE GLOBAL TEMPORARY TABLE work_area (startdate DATE, enddate DATE, class CHAR(20)) ON COMMIT DELETE ROWS; -------------------------------------------------------------------------------- SQL'S 3-VALUE LOGIC There has been a great deal written over the past 15 years about the difficulty of understanding the meaning of null values in columns. The null could mean two things: It could mean that data is not applicable in this situation, or it could mean that the value of the data is unknown. The presence of a null value does not tell us why the value is null. Also, with programming languages, we usually deal with simple, two-value logic: The result of a function is usually true or false. The introduction of nulls means that we must deal with three-value logic. A comparison can return true, false, or unknown. This three-value logic must be understood and dealt with in our SQL and programs. -------------------------------------------------------------------------------- SELECT STATEMENTS AS TABLES IN ORACLE AND SQL SERVER One handy facility supported by both DBMSs is the potential to use select statements as tables. Here is an example: Select c.custname, c.purchase_ttl, purchase_avg From customer c, (select avg(purchase_ttl) purchase_avg from customer); -------------------------------------------------------------------------------- SAME FEATURES, DIFFERENT NAMES SQL Server and Oracle both have many similar functions and comparison operators that perform the same but have different names or different syntax. These are too numerous to mention them all here, but here are a few of the most relevant examples: ORACLE SQL SERVER CEIL CEILING LN LOG LOG(10) LOG10 INSTR CHARINDEX RPAD Replicate or Space USER USER_NAME The important thing to know here is that the same functionality can often be achieved in both databases--though the function names and syntax may differ. If you're looking for a function in one database that you think must exist in the other, use your imagination and look for synonyms or different spellings. Keep in mind, though, that like-named features may offer different functionality. In other words, take nothing for granted. -------------------------------------------------------------------------------- PL/SQL AND TRANSACT/SQL CONVERSION STRATEGIES The white paper "Migrating Oracle Applications to SQL Server" from Microsoft TechNet recommends a conversion strategy when you convert PL/SQL code to Transact/SQL. The steps below borrow from this list and add a few other points to consider. Remove unnecessary code. Change outer joins to the ansi outer join standard. Replace and rewrite functions. Check all operators and replace those that are different. Replace concatenation operators ( || in Oracle and + in SQL Server). Replace pl/sql cursors with transact/sql cursors. Replace all procedures, packages, functions, and triggers with the corresponding syntax. Test the system functionality and performance. Keep in mind that syntax varies significantly between PL/SQL and Transact/SQL, so this code conversion should not be taken lightly. Migrating Oracle Applications to SQL Server by Miaco Corporation, The Database Experts, Denver Co. http://www.microsoft.com/TechNet/sql/Tools/Sqldevkt/ORCL2SQL.asp -------------------------------------------------------------------------------- ORACLE AND SQL SERVER TIMESTAMP SQL Server supports the timestamp data type, but this is misleading since it has nothing to do with date or time, but rather is a unique value that can be used to determine whether a row has changed. Applications use this to implement optimistic locking by verifying that a timestamp column at update time is the same as the one that was selected. This reduces locking overhead and increases concurrence and system efficiency. Do not assume that this is the same as an Oracle date data type--these are two very different things. -------------------------------------------------------------------------------- LOCK ESCALATION IN SQL SERVER SQL Server will perform page-level locks by default. When a query scans rows from a large number of pages, the page locks will turn into a table lock. This is done to increase efficiency, since each lock requires overhead and a large percentage of the table is going to be locked anyway. The default lock escalation threshold is 200 pages, but this value can be changed and set to be a percentage of the table size. When share page locks are escalated, the query obtains a share lock on the table, while exclusive page locks will escalate to an exclusive table lock. Understanding the lock escalation patterns of your system is critical to your overall design and the levels of concurrence that your system will be able to achieve. -------------------------------------------------------------------------------- LOBS IN INFORMIX The Informix Database provides support for the following four types of Large Objects (LOBs): BLOB: A smart binary large object that can store up to 4 TB (terabytes) of binary data. CLOB: A smart character large object that can store up to 4 TB of character data. TEXT: A relatively simple data type that can store up to 1 GB of text data. It is not nearly as powerful as a CLOB data type. BYTE: A relatively simple data type that can store up to 1 GB of binary data. It is also not as powerful as a BLOB data type. Yes, you read the above correctly. Just when you were getting used to the idea of a terabyte database, you now need to start thinking about the possibility of storing a four-terabyte column! -------------------------------------------------------------------------------- LIKE STATEMENT IN ORACLE AND SQL SERVER Both Oracle and SQL Server, of course, support the like statement. SQL Server provides a couple of capabilities to search within ranges. To search for a certain position within a range, use [range1-rangen] or [^range1-rangen]. For example, consider the case where you want to find all customers with a customer_name starting with 'S' and having a second character of 'A' or 'B' and not having a third character of 'C' or 'D.' With SQL Server, your code would look like this: Select customer_name from Customer Where customer_name like 'S[AB][^C]%'; In Oracle, the code would look like this: Select customer_name from Customer Where (customer_name like 'SA%' and customer_name not like 'SAC%') or (customer_name like 'SB%' and customer_name not like 'SBC%') -------------------------------------------------------------------------------- JOINS AND OUTER JOINS Oracle and SQL Server both support outer joins--but each in their own way. Oracle uses syntax that places a '+' next to the table that will be the inner table (table with fewer values and perhaps some null values) of the outer join. SQL Server supports its own proprietary syntax (*= and =*) as well as ansi syntax. Use the ansi standard whenever possible. In Oracle, for example, you would show all departments and their employees, where they exist: SELECT empid, deptid FROM emp, dept WHERE dept.deptno = emp.deptno (+); DNAME ENAME 5 1 5 1 6 In SQL Server, this is written using the ansi standard: Select empid, deptid FROM dept left outer join emp on dept.deptno = emp.deptno; -------------------------------------------------------------------------------- GETTING THE FIRST ROWS USING A SUBSELECT Neither DB2 nor Oracle provides a feature that allows us to get the top (or first n) rows, as SQL Server does with its Select Top n From Table feature. This can be overcome, to a certain extent, with the use of a subselect. Consider the case where you want to retrieve the first 20 rows from a table, in which case you could perform the following subselect: Select * >From Customer outcust Where 20 > (Select count(*) From Customer incust Where outcust.customer_id < incust.customer_id); -------------------------------------------------------------------------------- GENERATING UNIQUE KEYS IN ORACLE AND SQL SERVER Generating unique keys in the two databases is simple, though both have a different implementation. Oracle, for starters, uses sequences. A sequence is created as an independent object from any table that can be used to generate unique keys. In SQL Server, an identity column can be used to do the same thing. Identity differs from a sequence in that it is part of the table and column. Both of these allow a starting (ident_seed) value as well as an incrementing value (ident_incr). The implementation and use of these are both simple and very efficient. -------------------------------------------------------------------------------- DATES IN ORACLE Oracle users who want to use the Date column to store the Date without an implied Time should first set the Time portion of the Date column using the "trunc" function. For example, using the trunc function shown below will set the time to 00:00:00: Select trunc(sysdate) from Dual; In this case, if trunc were not used, the current system time would be used. Trunc can also be used to round dates. The format for this is: Trunc(date,format) Without a specified format, the time will be truncated. Setting the format sets the date to the beginning of that time period. For example, consider the following truncation to the year: Select trunc(to_date('2001-FEB-22','YYYY-MON-DD'),'YEAR') from Dual; returns '2001-JAN-01' -------------------------------------------------------------------------------- DATES IN ORACLE When using dates in Oracle, you need to be aware that the time is always stored in the data column. In cases where you can see only the date part of the column, you could display a hidden time value by altering your session as follows: alter session set NLS_DATE_FORMAT = 'YYYY-MM-DD HH24:MI:SS'; ` When you store date values in Oracle, use the absolute date for logging and audit purposes, and truncate a date (this sets the time value to 00:00:00) when it is being used to represent a day--for example, in a Calendar table--as opposed to a time. -------------------------------------------------------------------------------- CREATE TABLES WITH SELECT STATEMENTS Both Oracle and SQL Server allow you to create tables by selecting rows from other tables. The syntax, however, is different: Here is the sample code in Oracle: Create table new_table_name as select * from old_table where... ; Here is the sample code in SQL Server: Select * into new_table_name from old_table where ... ; Note that referential constraints are not created on the new table and that SQL Server requires the "select into/bulkcopy" option be set to true. -------------------------------------------------------------------------------- COMPARING ORACLE AND SQL SERVER DATA TYPES Data types between Oracle and SQL Server are not exactly the same, but many are similar. For structured (non-blob) data types, some data type comparisons are shown below: Oracle and SQL Server both use Char data types. For varying character data types, use varchar2 in Oracle and varchar in SQL Server. Oracle's raw and long raw can be converted to SQL Server's varbinary and image. These are the data types that are roughly similar between the two databases. Some data types that are quite different include numbers, dates, and identity types. Oracle uses only Number, while SQL Server supports tinyint, smallint, int, numeric (this is the closest to Oracle's Number), float, or real. Oracle's Date is its only date data type, which is roughly similar to SQL Server's datetime. Oracle's Sysdate is similar to SQL Server's Getdate(). The Oracle currval and nextval functions are similar to the identity data type and its related functions. As you can see, there are some similarities in data types, but none are exactly the same. -------------------------------------------------------------------------------- CODING OUTER JOINS When writing outer joins--be they left-outer joins or right-outer joins--consider writing them as left-outer joins. This will provide consistency when writing and reading your code. Some systems convert outer-joins from right to left in Explains and in optimization. Writing the outer join as a left-outer will allow you to more easily understand the statement and Explain information. -------------------------------------------------------------------------------- BLOBS AND LOGGINGDatabases provide logging capabilities that are used to support database backup and recovery. When a change is made to a table row, a "before" and "after" image of the changed data is written to the log. An insert operation will write only an after image of changed data, and a delete operation will write only a before image to the log. An update has both before and after information written to the log. In the case of LOB data, you need to consider whether to turn on logging. This requires analyzing your database performance requirements and weighing them against data availability and reliability. One of the great advantages of storing LOBs in databases is that it allows you to potentially manage all of your data in the database and to take advantage of the DBMS's backup and recovery mechanisms. This allows you to relate the LOB data to other business data and to provide consistent backup and recovery of business data and LOB data. There is, however, a large performance penalty that must be paid when logging this much data--and if you have another way of restoring it (say, from a file backup), then you should turn off logging for the LOB columns. Please keep in mind that I hate giving this type of advice (i.e., turning off logging), but it is a performance reality given the current state-of-the-art technology. I assume this will change before too long.
SQL: SINGLE QUOTES AND NUMBERSA user recently sent me a snippet of code that was giving him a problem in SQL Server. He was trying to insert a record using the Execute method of the Connection object, and his statement looked something like this: INSERT INTO tblData (Field1, Field2) VALUES ('123', '456') While this looked fine at first glance, closer inspection revealed that Field1 and Field2 were defined as numeric data types. In this case, the single quotes were not required--and, in fact, were not allowed to be around the data values. Once the single quotes were removed, the code worked fine. When writing SQL statements, be sure to keep an eye on the correct use of single quotes.
|
Just Check out some of our sponsors |
|
COPYRIGHT 1998 - 2009 All names used are Trademarks of the respective companies Send mail to
CompanyWebmaster with
questions or comments about this web site.
|