Wednesday, January 13, 2016

SQL Server Statistics - Histograms

Most of us are aware of Statistics in SQL Server and its purpose. But still, just to get started, statistics store information on the distribution of data in a column(s), which helps the query optimizer pick the right query plans. Effectively, it helps to answer questions like "how many rows with value 25 on column "Age" or how many more than 15 etc..

This post will get little in depth on how statistics are actually stored in the database. Statistics are actually stored as "histograms" ( One may take a look at histogram used in statistics by expanding a commonly queried table in your database, expand statistics -> properties -> Click on "Details" tab.

Interesting part of the image above is the histogram steps. One would notice that histogram steps contain entries for selected values in the column. Basically what histogram steps intend to convey is the number of rows that exist for a particular or range of values.
For example, for a column containing
marks, it attempts to say
0 - 10   there are 15 rows
10 - 30 - there are 45 rows
50 - 80 - there are 60 rows
80 - 100 - there are 20 rows and so on..
Observe RANGE_HI_KEY column. It indicates the highest value in the range. Each row represents the range starting from the ending of previous row ( Previous row RANGE HI KEY value )  to the value in RANGE_HIGH_KEY. For example the second row "" indicates distribution between x to y
  1. RANGE_ROWS Column indicate the total number of rows in the range ( Previous hi key to current hi key )
  2. DISTINCT_RANGE_ROWS indicate how many among them are distinct
  3. AVG_RANGE_ROWS indicate the average row returned per distinct value
  4. EQ_ROWS indicate total the number of rows for the range hi key value alone.
The most important part is number of steps in the histogram are fixed to 200. Meaning the histogram will have maximum of 200 entries / rows irrespective of number of rows the table. Even if there are 1 million rows, 200 entries will describe the column's data distribution.
So, why this post on statistics? :) Armed with this knowledge, let us dive into deeper concepts like cardinality estimator and how it has changed in SQL 2014 in upcoming posts.

Wednesday, November 18, 2015

In-Memory OLTP Table Checkpoint Processes Performance Comparison

Checkpoint processes and other internal processes have undergone major changes for In Memory OLTP tables and transactions. The following article of mine compares the performance of checkpoint processes in disk based ( traditional ) tables against In memory tables. I understand that, I have not written about checkpoint processes of In Memory tables here. Will be doing soon. Anyways, sharing the article below. Read n comment if interested

Friday, November 6, 2015

Hekaton Part 11 :- Finding row count in memory table

Finding a row count in a "in memory" Hekaton table can be a tricky task as the regular commands and DMVs don't give the desired result.

  • sp_spaceused - doesn't work and returns a blank result
  • sys.dm_db_partition_stats - doesn't track in memory tables.

So the  option we are left with is
     [Schema Name],

      OBJECT_NAME(Ind.OBJECT_ID) As [Table Name],

      ISNULL([rows],0) as [Total Records]
      FROM sys.dm_db_xtp_index_stats AS Ind
      CROSS APPLY sys.dm_db_stats_properties
      WHERE Ind.index_id =2
However, please note that the above query will give correct row counts if and only if statistics are updated for the table. For in memory tables, statistics are not automatically updated manually and hence will have to be updated  before one attempts to find the row count. So, please refer to the query below, to update the statistics for in memory table alone.
DECLARE @id int,@rowcnt int
DECLARE @in_mem_tbl_name varchar(500)
SET @id = 1
CREATE TABLE #in_memory_tbl(id int identity(1,1),in_mem_tbl_name varchar(500))
INSERT INTO #in_memory_tbl(in_mem_tbl_name)
SELECT object_name(object_id) FROM sys.dm_db_xtp_table_memory_stats
WHERE object_id > 0
SET @rowcnt = @@ROWCOUNT
WHILE @id <= @rowcnt
SELECT @in_mem_tbl_name = in_mem_tbl_name
FROM #in_memory_tbl
WHERE id = @id

SET @id = @id + 1
DROP TABLE #in_memory_tbl

Please ensure to run the statistics section above, before attempting to find the row counts on "in memory tables"




Monday, November 2, 2015

Hekaton Part 10: Things Hash Indexes Can do

Hash Indexes on "In memory" tables are best suited for equality searches. Reason is obvious, as they are based on hash algorithm which doesn't sort data, hash indexes aren't useful for sorting or non equality searches ( Refer here). But, hashing being a fast and a effective technique, hash indexes are perhaps the best bet for handling equality searches ( where conditions using "=" ). Note that, equality searches are by far the most common filter condition you would see in queries. So, it is one of those things Microsoft has done to address the obvious with best possible solution ignoring the not so common problems :) Besides equality filters other areas where hash indexes can be useful are

Group by clause:

Consider the following query

SELECT count(*),data
FROM [HK_tbl]

Query plan indicates a index scan but no additional sorting or grouping required, as even though data is not sorted by hash index, it is grouped / categorized and hence it is very effective for processing "group by" queries.

Inner Join:

HK_tbl2 is another "in memory" table with hash index on data column.

ON =

Outer table is scanned as there is no filtering condition while the inner table makes use of the "Hash Index" . Query plan shown below


Left Outer Join:

ON =

Semi equijoin condition like left outer join also benefits from hash index


Sunday, October 25, 2015

Hekaton Part 9: Things Hash Indexes can't do

Last few posts have explored hash indexes in depth. This one will deal with the limitations on Hash Indexes. What are things hash indexes can't do?
Hash indexes are based on hash functions, which doesn't sort the data at all like conventional (binary tree indexes). Hash Indexes just categorize the data into hash buckets which help in fetch the required row faster.

 "HK_tbl" is a in memory table, with column "data" having a hash index.

Order by:

Consider the following query using Order by:

ORDER BY data desc;

Query plan shown below:

Observe the "Sort" operator. Sort operator indicates that the data was sorted and index didn't avoid sort operation.

Pattern match using like and '%'

WHERE data like 'C%'

Observe the filter operator. Filter operator indicates that the predicate ( or the filtering condition ) was applied after the index was scanned. The number of rows which came out of index scan operator would also confirm that index scan operator read the entire table.

Similarly, non equality wouldn't be using hash indexes. For handling range scans, sort operations, the other index type called range index ( also called as in memory non clustered index ) in "in memory" tables would help. 

On next post, we will see the scenarios, which explore scenarios where hash indexes are most useful.

Sunday, October 11, 2015

Hekaton Part 8 - Hash Collisions - Hash_Index_Stats DMV

Continuing from previous post , this post will discuss a few scenarios on hash collisions.
The following "in memory" table with a hash index is created. Please note that the hash bucket count has been set to 1024. Hash bucket count 1024 implies that hash collisions are certain after row count increases 1024.

CREATE TABLE dbo.hash_collision
[Data] uniqueidentifier DEFAULT newsequentialid() ,
Step 1: Load 500 rows
 Insert into hash_collision(dt) select getdate()
GO 500
dm_db_xtp_hash_index_stats dmv can be used to check statistics on hash indexes n hash collisions. At 500 rows, the hash index stats indicates that the buckets are partially filled in.

WHERE object_name(object_id) = 'hash_collision'

  •  "Total Bucket count" indicates the number of buckets in the hash index which is fixed.
  • "Empty Bucket count" indicates the number of buckets that are empty. In this case, close to 50% are empty as we have inserted only 500 rows.
  • "Average Chain Length" indicates average length of a hash chain. In other words, average number of hops one may need to take to find a row.
Step 2: Add another 500 rows
Let us add another 500 rows and check the index status from dm_db_xtp_hash_index_stats
Insert into hash_collision(dt) select getdate()
go 500
select * FROM sys.dm_db_xtp_hash_index_stats WHERE object_name(object_id) = 'hash_collision'

 dm_db_xtp_hash_index_stats indicates that 80% of the buckets are full but the average hash chain length is still at 1 as there are still a few empty hash buckets.

Step 3: Add 9000 rows

Let us add few more thousands - say 9000 rows. Now table contains 10,000 rows but only 1024 hash buckets

Insert into hash_collision(dt) select getdate()
go 9000

SELECT * FROM sys.dm_db_xtp_hash_index_stats
WHERE object_name(object_id) = 'hash_collision'

  Notice a sharp increase in "Average Chain Length" as there more values than the number of buckets. The number of empty buckets is obviously zero.

So does longer hash chain affect the performance? Longer hash chains cause reads or the index scan to be slower. So, it is important to pick the hash bucket count carefully. General recommendation is at least 2 times the number of distinct values in the table. Also, it is always better to over size the hash bucket count instead of under sizing it.