SQL and SQL only: 2009

Monday, December 28, 2009

Alt + F1

This post is by far the silliest post by me but again very useful and handy :)
This one has been the most popular tip among by developers and no indexing/query tuning tip impressed them this much. So whats that brilliant(!) tip ?

On a Query Analyzer/Management Studio (SSMS), If you want to find the properties of the table just select the table name and press ALT + F1. Provides the Columns in the table,constraints,indexes,Identity column details etc. Actually it provides the result of sp_help. Grab a look at the picture provided below.

Just a word of caution. Length column expressed by sp_help is size of a column in bytes.So for a nvarchar column length 100 means that column can hold a max of 50 characters only. If you are used to seeing column properties from design view in Enterprise manager/SSMS then can fall to this trap.Though all of us know about the nchar/nvarchar's 2 byte per character storage its just one of those things which one needs to bear in mind while using this shortcut key.

Other Shortcuts available by default are

* ctrl + 1 -> sp_who
* ctrl + 2 -> sp_lock

You can customize it, as it is available under Tools -> Customize in Query Analyzer and Tools->options in SSMS.Picture provided below.

Note that you can even add queries. I have added a look up on sysprocesses to have a quick look on whats happeining on my server and sp_spaceused for approximate row counts :)

Monday, December 21, 2009

Table Variable -> Recompilations,Statistics and Parallel Plans

My last post on Table variable vs Temp table ended with comparisons of their performance at different scenarios. This one not essentially a part -3, but will deal with few characteristics of table variable.

By definition, Table variables are never recompiled. Just to provide little intro about Recompilations,Recompilation reloads a query plan into the cache. That is , the current plan prepared for the statement is marked as invalid and a new plan is loaded into the cache and this process is briefly put as recompilation. Recompilations happen when statistics are updated to an object,when the definition of an object changes and a few other scenarios. As stats are never maintained and the structure of a declared table variable can never change, table variables never cause a recompilation. Though Recompilations come at a cost, they are useful in maintaining the correct stats for our objects and hence its a trade off between Recompilation cost vs Updated stats.

Consider the following query:

DECLARE  @dt DATETIME

SELECT @dt = Getdate()

DECLARE  @tb  TABLE(

                    row_number  INT    PRIMARY KEY,

                    supplier_id INT,

                    product_id  INT,

                    trans_city  VARCHAR(100),

                    trans_date  DATETIME,

                    qty         INT ,

                       UNIQUE (supplier_id,row_number ))

INSERT INTO @tb

SELECT Row_number()

         OVER(ORDER BY supplier_xid),

       supplier_xid,

       product_xid,

       trans_city,

       trans_date,

       qty

FROM   supplier_product_customer

SELECT *

FROM   @tb

WHERE  row_number = 100

SELECT *

FROM   @tb

WHERE  supplier_id = 1

SELECT Datediff(ms,@dt,Getdate())

The table contains 110k rows.

As you may notice, I am having a table variable, a Primary Key on the column row_number which implies, it would have a clustered index on it and a Unique key on Supplier_id,row_number which will create a non clustered index on it.There is a myth going around saying that you cant have Non clustered indexes on table variable. It is true that you cant directly create NC index but you can do so indirectly by making unique keys. Though, the NC index on table variable is not all that useful, but just wanted to remind that it is possible to create one.ok.Now for the query plan..

As you may notice the highlighted section, which shows a clear differnce between Estimated rows and actual rows.When you use table variable the estimated count is always 1 as it has no stats for judge the number of rows.

In comparison, consider the same code against a temporary table

SET statistics  io  ON

DECLARE  @dt DATETIME

SELECT @dt = Getdate()

CREATE TABLE #tb (

  row_number  INT    PRIMARY KEY,

  supplier_id INT,

  product_id  INT,

  trans_city  VARCHAR(100),

  trans_date  DATETIME,

  qty         INT)

CREATE INDEX [CIX_supplier_product_customer] ON #tb (

      supplier_id)

WITH (statistics_norecompute = OFF,

      sort_in_tempdb = OFF,

      ignore_dup_key = OFF,

      drop_existing = OFF,

      online = OFF,

      allow_row_locks = ON,

      allow_page_locks = OFF)

ON [PRIMARY]

INSERT INTO #tb

SELECT Row_number()

         OVER(ORDER BY supplier_xid),

       supplier_xid,

       product_xid,

       trans_city,

       trans_date,

       qty

FROM   supplier_product_customer

SELECT *

FROM   #tb

WHERE  row_number = 100

SELECT *

FROM   #tb

WHERE  supplier_id = 1

SELECT Datediff(ms,@dt,Getdate())

one small difference. I am creating NC index directly on supplier_id instead of a unique key,as we can directly create a NC index in temp tables. Query plan for the same provided below.

Few things to infer from the query plan. The Actual rows and estimated rows are perfect, as stats are present. Second inference is that Query uses the non clustered index. Table variable doesnt use the NC index created and uses a cluster scan instead. NC index usage would have been effective as the Number of IOs for executing the query 'Select * from #tb where supplier_id = 1' was 38 and the same query against the table variable was 631.

The query 'Select * from #tb where supplier_id = 1' was repeated at 100/1000/10k/100k rows and at all cases NC index on table variable was never used.I observed that NC index created on table var got used only when the NC index covered the query.ie. When the query was changed to 'Select supplier_id,row_number from #tb where supplier_id = 1'. So, the NCIndex on a table variable is not a great idea as the storage engine finds it effective to do a scan most of the times.
On the other hand, the Temp table uses the NC index and provides results at lesser number of IOs.But still, time taken still marginally favours table variables and so whether indexes are getting used or not, table variables win when the queries are fairly simple.

Now for another aspect of table variables:

BOL states that table variable doesn't use parallel plans for updates and deletes.
Table variables don't use parallel plans as effectively as temp tables do.If your 'select' query would gain from Parallel plans then you are better off going for temp tables. Just a example query:

SELECT   t1.supplier_id,
         t1.product_id,
         Count(1)
FROM     #tb t1,
         #tb t2
WHERE    t1.product_id = t2.product_id
         AND t1.trans_city = t2.trans_city
         AND t1.supplier_id != t2.supplier_id
GROUP BY t1.supplier_id,
         t1.product_id
HAVING   Count(1) > 100
ORDER BY t1.supplier_id,
         t1.product_id

Query plan for the same is provided below.

The query finished in about 5 seconds with a parallel plan . The table had 50,000 rows.

The query plan for the same query when table variable is used is provided below.

Query plan doesnt use a Parallel plan ( Degree of parallelism = 1 ). The query took 16 minutes to complete.5 seconds against 16 minutes makes a lot of difference. So, a Parallel plan for complex queries can provide a huge improvement on performance. So, if the query is likely to use parallel plans and its likely to gain out of it ( there are few exceptions where parallelism can slowdown as well which we will discuss on another day ).

Though the table variable doesnt use stats/parallel plans, for simpler queries with relatively small row counts table variables appear to be a fair choice. For Decision making/Datawarehouse type of applications which involve complex calulations with huge rows the balance shifts sharply towards temp tables.

Monday, December 14, 2009

Error - Invalid length parameter passed to the SUBSTRING function.

This is my 25th post. kind of special. :)
So, I am starting a new cateogory called 'Errors'. Under Errors we will deal with those sudden errors which pop up out of the blue. A bit of ( sometimes a lot )
of investigation clears the air and provides a good story to tell.I have one such story to tell today.

I had one developer who came to my desk and said 'Suddenly my stored procedure is failing with the following error

'Invalid length parameter passed to the SUBSTRING function.'

The developer told me that the Stored procedure was running absolutely fine for the
last 2 years in production and there has been no change on it.( This is somethng all
developers say :) ). True, there was no change on the sp. With bit of debugging, i found that the following query was failing.

SELECT *

FROM   dbo.authors_test

WHERE  Charindex(' ',city) - 1 > 0

       AND auto_id IN (SELECT auto_id from authors_test

                       WHERE  Left(city,Charindex(' ',city) - 1) = 'San')

Its obvious that the query was failing because of the function
left(city,charindex(' ',city) -1 ).Function left(city,charindex(' ',city) -1 ) would get a negative 2nd parameter when there is no space in the column city. So the query would fail when it applies the function on any row which doesnt contain ' ' on the column city. But, however if thats the case then the query should have failed long back and not all of a sudden. And also, the developer claimed that as he has done the checking charindex(' ',city) -1 > 0 on outer query to filter out all the rows without the space are filtered out first.Let us check the actual reason the query failed.

The table had a primary key clustered index in auto_id. No other indexes.

SELECT *

FROM   dbo.authors_test

WHERE  Charindex(' ',city) - 1 > 0

       AND auto_id IN (SELECT auto_id from authors_test

                       WHERE  Left(city,Charindex(' ',city) - 1) = 'San')

When I 1st ran the query the table contained little over 90 records and it failed with the error
'Invalid length parameter passed to the SUBSTRING function.'

To check the claim of the application team that the query ran perfectly few days before,I took backup of the database that was few days old and ran the same query. It worked perfectly. The table contained around 50 odd rows and the query worked fine. When checked with application team on the row count, they replied saying the increase in the row count was normal and the data didnt have any dirty/junk/Null values on city column.

So difference in row counts gave me a bit of clue and I started comparing the query plans of the old and new table. First the old one had a query plan which is shown below.

As you can understand, the query plan works in the following way

Step 1 : filter out the rows using the condition
charindex(' ',city) -1 > 0.
uses a Clustered index scan as there are no useful index to use.
Output out of the Clustered index scan operator is auto_id as it
will be used to join with the sub query.

Step 2 : Use a Nested loop join operator to join with the sub query.
Step 3 : Uses a Clustered index seek operator with auto_id as input from
outer query.Clustered index seek operator uses the Clusered
index to directly jump to the row with auto_id it is interested
in. After finding the row directly it applies the conditon
left(city,charindex(' ',city) -1 ) only against the
auto_id selected.

Now let us look at the same query that fails on the latest database.

The estimated query plan( only estimated plan possible as the query fails and hence full plan is not available) shows two differences.

1) The Clustered index seek operator is replaced by Clustered index scan operator at the inner loop ( sub query ).
2) Nested loop join is replaced by Merge join.

The query plan clearly provides the reason why the query failed. As in the second case the Clustered index scan operator on the subquery would scan all the rows in the table for the condition left(city,charindex(' ',city) -1 ) and then match it with the input auto_id. So , in that case though the sub query will read all the values in the table which means charindex(' ',city) -1 ) > 0 condition
specified at the outer query is in vain. The query took a clustered index scan instead of seek as there were more rows in the table than before. The optimizer swithces from a seek to scan when it feels that the cost of seek( mostly IOs ) would be more than that of a scan. How it decides is a potential post on its own, which I will perhaps cover later.As of now, understand that a switch from seek to scan can occur beacuse of data distribution/missing statistics/fragmented indexes etc.

So, a change in query plan can not only slow down your system, but also cause runtime errors in scenarios like this. This also proves the importance of reindexing, as a seek can become a scan even when a reindexing
job fails.So, lessons learnt are

> Always ensure you read the query plan before you deploy
any code to production
> Try your best to visualize the table load and perform data
validations in a fool proof way.
> Make sure reindex jobs are running at the correctly, so
that the queries pick the correct plan.

Noticed that for such queries SQL 2k5 takes a cluster scan and hence errors are indicated early. But still, as any RDBMS changes query plans with respect to the factors explained above, its good to be wary of such scenarios.

Monday, December 7, 2009

Temp table vs Table Variable - Part ii

Before reading this one,please read my previous post here which speaks about basic differences in temp table and table variable.

As mentioned last week, let me run thro the tests I conducted for understanding the performance of temp table against table variable.

Consider the following code:

DBCC freeproccache
DBCC dropcleanbuffers
SET statistics  io  ON
DECLARE  @dt DATETIME
SELECT @dt = Getdate()
CREATE TABLE #tb (
  supplier_id INT,
  product_id  INT,
  trans_city  VARCHAR(100),
  trans_date  DATETIME,
  qty         INT)
INSERT INTO #tb
SELECT DISTINCT supplier_xid,
                product_xid,
                trans_city,
                trans_date,
                qty
FROM   supplier_product_customer
CREATE CLUSTERED INDEX [CIX_supplier_product_customer] ON #tb (
      supplier_id,
      trans_city)
WITH (statistics_norecompute = OFF,
      sort_in_tempdb = OFF,
      ignore_dup_key = OFF,
      drop_existing = OFF,
      online = OFF,
      allow_row_locks = ON,
      allow_page_locks = OFF)
ON [PRIMARY]
SELECT *
FROM   #tb
WHERE  supplier_id = 1721
       AND trans_city = 'Mumbai'
SELECT Datediff(ms,@dt,Getdate())

As you may note, I am insterting a few rows on the temp table. And then I create a clustered index. Fire a straight forward select query.What matters in the code is the section after the insert into the temp table. ie is Creation of the index and the select query.I have noted down a few relevant parmeters for different row counts on the temp table. The query plan obtained is provided below.

Its as expected, a Clustered Index seek which means that the index is getting used.

Let me execute a similar query against table variable. Code posted below.

DBCC freeproccache
DBCC dropcleanbuffers
SET statistics  io  ON
DECLARE  @dt DATETIME
SELECT @dt = Getdate()
DECLARE  @tb  TABLE(
                    supplier_id INT,
                    product_id  INT,
                    trans_city  VARCHAR(100),
                    trans_date  DATETIME,
                    qty         INT
                    )
INSERT INTO @tb
SELECT supplier_xid,
       product_xid,
       trans_city,
       trans_date,
       qty
FROM   supplier_product_customer
SELECT *
FROM   @tb
WHERE  supplier_id = 1721
       AND trans_city = 'Mumbai'
SELECT Datediff(ms,@dt,Getdate())

Query plan obtained for the same is provided below. A straight table scan on table variable as there was no index.

Querycost/IO/time taken for 100/1000/10000/20k/100k rows have been recorded and tabulated below.Please click on the image to view results clearly.

The table compares Query cost, IO, time taken parameters of temp table and table variable.While calulating IOs, one need to take into account the IOs incurred by creating the index as well.So IOs for temp table are represented as 1 + 2, where the number before the '+' sign indicates the IOs incurred by creating the index.

For creating the index, one needs to scan the entire table once. So the above code on using temp table would incur IOs for scanning the table once + IOs for fetching the required row(s). On the other hand, as the query is simple,a table variable can obtain the required row(s) by scanning the table once. The above code using table variable would always incurr the IOs required to scan the entire table once,iresspective of number of in the table.So on IOs front, for a query which as simple as the above one, table variables would always consume less when compared to Indexed temp table.

For time consumed, table variable is faster even at 100k rows. Query cost,which is a number telling you how hard your machine needs to work to execute the query also favours table variables in this context.So when the query is simple and striaghtforward, table variables provide better performance.

Now for the another scenario where a slightly complex query is executed against temp table / table variable.The query attempts to find the city which obtained maximum sales for each supplier.The query against temp table is provided below.

DBCC freeproccache
DBCC dropcleanbuffers
SET statistics  io  ON
DECLARE  @dt DATETIME
SELECT @dt = Getdate()
CREATE TABLE #tb (
  supplier_id INT,
  product_id  INT,
  trans_city  VARCHAR(100),
  trans_date  DATETIME,
  qty         INT)
INSERT INTO #tb
SELECT supplier_xid,
       product_xid,
       trans_city,
       trans_date,
       qty
FROM   supplier_product_customer
CREATE CLUSTERED INDEX [CIX_supplier_product_customer] ON #tb (
      supplier_id,
      trans_city)
WITH (statistics_norecompute = OFF,
      sort_in_tempdb = OFF,
      ignore_dup_key = OFF,
      drop_existing = OFF,
      online = OFF,
      allow_row_locks = ON,
      allow_page_locks = OFF)
ON [PRIMARY]
SELECT   supplier_id,
         trans_city,
         Sum(qty)
FROM     #tb outer_qry
GROUP BY supplier_id,
         trans_city
HAVING   Sum(qty) = (SELECT Max(sum_of_qty)
                     FROM   (SELECT Sum(qty) AS sum_of_qty
                             FROM   #tb sub_query
                             WHERE  sub_query.supplier_id = outer_qry.supplier_id
                                    AND sub_query.trans_city = outer_qry.trans_city) x)
SELECT Datediff(ms,@dt,Getdate())
DROP TABLE #tb

The query plan for the same query is provided. The query plan uses a Nested loop join to achieve the resultset.For each supplierId, the clustered index seek is used to fetch the corressponding rows from the inner loop.Clustered index scan is used at outer loop, so that Supplier IDs enter the inner loop in sorted order.So, in
short the clustered index is used effectively.

Similarly the query against table variable is also provided below.

DBCC freeproccache
DBCC dropcleanbuffers
SET statistics  io  ON
DECLARE  @dt DATETIME
SELECT @dt = Getdate()
DECLARE  @tb  TABLE(
                    supplier_id INT,
                    product_id  INT,
                    trans_city  VARCHAR(100),
                    trans_date  DATETIME,
                    qty         INT
                    )
INSERT INTO @tb
SELECT supplier_xid,
       product_xid,
       trans_city,
       trans_date,
       qty
FROM   supplier_product_customer
SELECT   supplier_id,
         trans_city,
         Sum(qty)
FROM     @tb outer_qry
GROUP BY supplier_id,
         trans_city
HAVING   Sum(qty) = (SELECT Max(sum_of_qty)
                     FROM   (SELECT Sum(qty) sum_of_qty
                             FROM   @tb sub_query
                             WHERE  sub_query.supplier_id = outer_qry.supplier_id
                                    AND sub_query.trans_city = outer_qry.trans_city) x)
SELECT Datediff(ms,@dt,Getdate())
SELECT Getdate()

The query plan is also shown below.Query plan is similar to the one generated for temp table but for few differences.One difference is a table scan at both inner and outer sections of Nested loop operator.Another major difference is additional sort operator required have the rows sorted before passing them to the
inner loop. sort operator can be extremely expensive when run against huge tables.

The results for both the scenarios are tabulated.The results now heavily favour temp tables.Please click on the image to view the results clearly.

On the IOs front temp table always requires lesser IOs as the index proves to be handy. The table variable can only scan and hence the number of IOs increases when the query engine has to read the table multiple times. There is not much of a difference in time taken between the two till 10k rows. But once the
row count reaches 20k the performance of table variable dips badly. Observe the results marked in Red, which show a major performance difference when the table grows bigger. At 1 million rows temp table takes a little less than
3 seconds but table variable takes over 75 seconds which is 25 times slower.

To Summarize, one can go for a table variable when

* Table is really small with a max of 1000 odd rows.
* No complex queries are involved and the table needs
to be read only once/twice.

A temp table is a ideal choice when

* Number of rows are more then 10k
* Queries are complicated and would require multiple passes.
* Query looks like an ideal candidate for indexing.

Anyways, one is strongly encouraged to test with both the options when scenario is not as clear as the examples posted above.

There are few more things to be written. Will do a part-3 as well.

Monday, November 30, 2009

Temp table vs Table variable

Ok. Another done to death topic. I will try my best to say something that has not been said.( or atleast said little ).

Before we get started , couple of popular myths to be clarified.

* Both temp table and table variable use tempdb.
The difference is table variable's data is initally stored in memory
and when the data grows bigger it is pushed to tempdb.
The object definition of table variable when it is created is stored in tempdb.

* Performance wise both have their advantages and disadvantages. There are a few
scenarios where temp table outperforms a table variable which we will attempt to
see.

Few pointers about them are given below:

* Table variable can have constraints but not indexes. This means that one can
create primary key( creates cluster index ) on table variable or a unique
constraint ( creates non cluster ) but not non unique clustered or non
clustered indexes.However, temp table can have constraints and indexes.
* Statistics are not maintained for table variable but yes for temp table
* Table variables are not recompiled but temp table's are recompiled
* Scope of a table variable is limited to the batch/section but temp table is
available thro out the session.
* changes made to a table variable cant be rollbacked but on a temp table
rollback is possible.

The points 1,2,3 have serious implications on the performance of your code and becomes a major factor in deciding your temp object type.

This post is just a starter placing a few facts on this topic. Next few posts,
I will come up withn few scenarios and examples to compare the performance of
both the objects.So please hang on for few more days for in depth analysis :)

Sunday, November 22, 2009

Covering Index - Include NonKey columns

To help my impatient readers reaching here thro google search, let me come to the point right away.

What is Include Non Key columns in SQL 2005?

Include Non Key Columns allows one to add columns only to the leaf of a non clustered index. Include Non Key columns option is applicable only for Non Clustered indexes.

Whats the use?

* SQL Server has a restriction that the Key columns of an index ( the columns that are present in the non leaf levels of an index) can have a maximum size
of 900 bytes. Also, Number of columns within an index cant exceed 16. Non key
columns are not counted for these restrictions. So if you are likely to execeed
900 bytes Include non key columns.
* Including Non key columns increases your chances of covering your query and hence boosting performance.

How to include non key columns?

SSMS->User Table->Indexes->Index properties->Included Columns.
Refer Screenshot below.

/*****************************************************************************************/

Now for my usual lengthy write up.

Covered indexes have a few drawbacks as stated above. 'Include Non Key columns' is wonderful especially in the following scenario.

DECLARE  @dt DATETIME

SET @dt = Getdate()

SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date,

       full_payment_receieved,

       payment_mode

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Moscow'

       AND payment_mode = 'Amex'

SELECT Datediff(ms,@dt,Getdate())

Supplier_xid,Customer_xid,Product_xid,Trans_city,Trans_date Columns from the Primary key.Trans_city column has a nonclustered index.The stats obtained by running the query are provided below.The table contains 112155 rows.

Time taken : 343 ms

IO stats : Table 'Supplier_Product_Customer'. Scan count 1, logical
reads 402, physical reads 3, read-ahead reads 3, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Overall query cost : 0.32

Query Plan is shown below:

It shows Key lookup which means that the query engine refered to Non Clustered index
key and jumped to the table to get some data.This doesnt come as a surprise as the Clustered index and Non clustered index doesnt contain the columns full_payment_receieved, Payment_mode columns, as they are not covered by the Non clustered index.
So ,How can we improve it?
One way is to add these two columns to the Nonclustered index on the trans_city column. Thats not a great idea because

* Already there are 5 key columns that are part of the index ( 4 Cluster
+ 1 Non cluster ). Adding two more would make the index's non leaf
pages huge.

* Payment_mode,full_payment_receieved dont have great selectivity. In
other words,Payment_mode column can have a maximum of 6 disitnct
values. ie.Amex / master/cheque /Cash/Visa/Nets. So its not going to
filter out most of the rows.Trans_city column which has reasonable
selectivity ( greater filtering ability/column not having many
duplicate values ) filters most of the rows and only the rest are
filtered by Payment_mode column(filters 1/6th). This makes Payment_mode
a bad candidate for indexing.'full_payment_receieved' has only Yes/No
values and is not part of the filtering criteria itself, so it
doesnt make great sense to store them at non leaf levels of
the index.

Just to stress a basic point in indexes, its the non leaf levels ( higher/upper
levels of the B+ tree ) which direct the way to search for the required row. More
selective ( filering out more rows) the columns in non leaf, shorter the distance to reach the required row(s) and hence better the performance.

So whats the next option? Including these two columns at leaf level alone using
SQL 2k5's Include non key columns.

CREATE NONCLUSTERED INDEX [NCIX_Supplier_Product_Customer_Trans_City] ON [dbo].[Supplier_Product_Customer] (

      [Trans_city] ASC)

INCLUDE ([Payment_mode],[full_payment_receieved])

WITH (pad_index = OFF,

      statistics_norecompute = OFF,

      sort_in_tempdb = OFF,

      ignore_dup_key = OFF,

      drop_existing = OFF,

      online = OFF,

      allow_row_locks = ON,

      allow_page_locks = ON)

ON [PRIMARY]

So, now let us run the same query.

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

       --To Clear the cache

SET STATISTICS  io  ON

--To capture IO stats

DECLARE  @dt DATETIME

SET @dt = Getdate()

SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date,

       full_payment_receieved,

       payment_mode

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Moscow'

       AND payment_mode = 'Amex'

SELECT Datediff(ms,@dt,Getdate())

Query plan screen shot is provided above. Query plan doesnt have any Key lookup and Shows NonClustered index seek which implies the query was fully covered with in the index.

Other stats:

Time taken :170 ms

IO stats : Table 'Supplier_Product_Customer'. Scan count 1,
logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Query cost :0.003

You can see the difference. 170 ms against 340 ms which is almost 50% improvement.
IO stats indicate a tremendous improvement. 5 reads against 402 reads.
Query cost has improved 100 times from 0.32 to 0.003.

So to sum up, Include Nonkey column is ideal when

* Most of the columns have been covered and 1/2 columns have been left out.

* Left out columns are not a part of filtering criteria.

* Even if the left out column is a part of filtering criteria, if it is not
a major filtering condition and has a bad selectivity factor.

* Like for any indexed column, included columns shouldnt be frequently
updated.

Sunday, November 15, 2009

Covering Index

Before we get started with this post please read the last one here.Even if you dont, just know that the leaf of a non clustered index contains Clustered index key.

So what are we upto now?
Covered indexes.

Covered indexes arguablly gets the best out of an index. When all the columns that
are required by the query ( Both Select columns - Projection and where clause columns ( selection criteria ) )are present in an index, then the index is said to be covered index.Covered index performs much better than Non Clustered index + Row look up as the query need not go( rather jump ) to the Data page to fetch the data.
All data access starts and ends within the index itself. The number of data pages in a table are always greater ( theortically greater than equal to ) than the leaf pages of an index, as data pages need to hold all the columns in the table whereas the index leaf pages hold only the indexed columns. Let us see an example..

Consider the following table

CREATE TABLE [dbo].[Supplier_Product_Customer] (

  [Supplier_xid]  [INT]    NOT NULL,

  [Customer_xid]  [INT]    NOT NULL,

  [Product_xid]   [INT]    NOT NULL,

  [trans_date]    [DATETIME]    NOT NULL,

  [Qty]           [INT]    NULL,

  [Trans_city]    [VARCHAR](100)    NULL,

  [Note]          [VARCHAR](500)    NULL,

  [Rate_per_unit] FLOAT,

  [comments]      VARCHAR(500),

  [balance_left]  INT,

  [Payment_mode]  VARCHAR(50)

  CONSTRAINT [PK_Supplier_Product_Customer] PRIMARY KEY CLUSTERED ( [Supplier_xid] ASC,[Customer_xid] ASC,[Product_xid] ASC,[trans_date] ASC ) WITH ( pad_index = OFF,statistics_norecompute = OFF,ignore_dup_key = OFF,allow_row_locks = on,allow_page_locks = on,FILLFACTOR = 90 ) ON [PRIMARY])

ON [PRIMARY]

A typical transaction table with a primary key on Supplier id ( Supplier_xid) , Customer id ([Customer_xid] ),Product ID ([Product_xid]) and transaction date ( [trans_date] ). So We have clustered index key on our primary key columns.

The table has 112055 rows. Total number of pages is 1020. Consider the following query.

SET STATISTICS  IO  ON

--To get IO stats

SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Perth'

The query returns 22316 rows.

IO Statistics are provided below.

Table 'Supplier_Product_Customer'. Scan count 1, logical reads 1029, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Grab a look at the query plan.

Query Plan shows a clustered index scan.

Points to note are

* Query plan uses Clustered index scan
* Total Logical Reads : 1029
* Query cost: 0.881357

From IO stats and query plan its obvious that the entire table was read/scanned.

To make this query use a covered index, let us create a nonclustered index on Trans_city column.

CREATE NONCLUSTERED INDEX [NCIX_Supplier_Product_Customer_Trans_City] ON [dbo].[Supplier_Product_Customer] (

      [Trans_city] ASC)

WITH (pad_index = OFF,

      statistics_norecompute = OFF,

      sort_in_tempdb = OFF,

      ignore_dup_key = OFF,

      drop_existing = OFF,

      online = OFF,

      allow_row_locks = ON,

      allow_page_locks = ON)

ON [PRIMARY]

Let us run the same query again:

SET STATISTICS  IO  ON

--To get IO stats

SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Perth'

The query returns the same 22316 rows.

Query plan provided below.

IO Stats are provided below.

Table 'Supplier_Product_Customer'. Scan count 1, logical reads 100, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Points to note are

* Query plan uses Non Clustered index seek
* Total Logical Reads : 100
* Query cost: 0.105275

Query plan indicates a non clustered index seek using the new created NCIX_Supplier_Product_Customer_Trans_City index instead of the clustered index scan.
The number of IOs are also very very less and its down to 100 compared to 1029 in clustered index scan. The cost of the query shows great improvement from 0.88 to 0.105. Hence, there is a significant improvement in the performance as well.

The reason as explained, the query engine accesses only NCIX_Supplier_Product_Customer_Trans_City index and not the table at all as all the columns are present with in the index itself. The out of the columns selected Supplier_xid,Customer_xid,Product_xid,Trans_city,Trans_date make the clustered index which is present in the leaf nodes of Non clustered index. Trans_city column is the non clustered index key and so its also present in the non clustered index. Non Clustered index is definitely smaller than the entire table and hence less number of reads.. So better performance.Had it been any version before SQL 2k then using non clustered index would have needed 22316 IOS + cost of traversing index non leaf pages , which lot more expensive than table scan and hence it would have perfomed a table scan again.But with SQL 2k and after, the unwanted table scan was prevented.

Covered index in SQL 2k has a drawback.SQL 2005 has gone one more step ahead and improved this performance even more by making an enhancement on covered indexes.We will look into the drawback and the improvement done on SQL 2k5 in the next
post.