Sunday, October 25, 2015

Hekaton Part 9: Things Hash Indexes can't do

Last few posts have explored hash indexes in depth. This one will deal with the limitations on Hash Indexes. What are things hash indexes can't do?
Hash indexes are based on hash functions, which doesn't sort the data at all like conventional (binary tree indexes). Hash Indexes just categorize the data into hash buckets which help in fetch the required row faster.

 "HK_tbl" is a in memory table, with column "data" having a hash index.

Order by:

Consider the following query using Order by:

SELECT TOP 1000 [ID]
,[Data]
,[dt]
FROM [HKDB].[dbo].[HK_tbl] WITH (SNAPSHOT)
ORDER BY data desc;

Query plan shown below:


Observe the "Sort" operator. Sort operator indicates that the data was sorted and index didn't avoid sort operation.

Pattern match using like and '%'

SELECT TOP 1000 [ID]
,[Data]
,[dt]
FROM [HKDB].[dbo].[HK_tbl] WITH (SNAPSHOT)
WHERE data like 'C%'
 
 



Observe the filter operator. Filter operator indicates that the predicate ( or the filtering condition ) was applied after the index was scanned. The number of rows which came out of index scan operator would also confirm that index scan operator read the entire table.

Similarly, non equality wouldn't be using hash indexes. For handling range scans, sort operations, the other index type called range index ( also called as in memory non clustered index ) in "in memory" tables would help. 

On next post, we will see the scenarios, which explore scenarios where hash indexes are most useful.
 

Sunday, October 11, 2015

Hekaton Part 8 - Hash Collisions - Hash_Index_Stats DMV

Continuing from previous post , this post will discuss a few scenarios on hash collisions.
 
The following "in memory" table with a hash index is created. Please note that the hash bucket count has been set to 1024. Hash bucket count 1024 implies that hash collisions are certain after row count increases 1024.

CREATE TABLE dbo.hash_collision
(
[ID] Int identity(1,1) PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 1024) NOT NULL ,
[Data] uniqueidentifier DEFAULT newsequentialid() ,
[dt] datetime NOT NULL ) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
 
Step 1: Load 500 rows
 Insert into hash_collision(dt) select getdate()
GO 500
 
dm_db_xtp_hash_index_stats dmv can be used to check statistics on hash indexes n hash collisions. At 500 rows, the hash index stats indicates that the buckets are partially filled in.
 
SELECT
* FROM
sys.dm_db_xtp_hash_index_stats

WHERE object_name(object_id) = 'hash_collision'

 
 
  •  "Total Bucket count" indicates the number of buckets in the hash index which is fixed.
 
  • "Empty Bucket count" indicates the number of buckets that are empty. In this case, close to 50% are empty as we have inserted only 500 rows.
 
  • "Average Chain Length" indicates average length of a hash chain. In other words, average number of hops one may need to take to find a row.
 
 
 
Step 2: Add another 500 rows
 
Let us add another 500 rows and check the index status from dm_db_xtp_hash_index_stats
 
Insert into hash_collision(dt) select getdate()
go 500
 
select * FROM sys.dm_db_xtp_hash_index_stats WHERE object_name(object_id) = 'hash_collision'






 dm_db_xtp_hash_index_stats indicates that 80% of the buckets are full but the average hash chain length is still at 1 as there are still a few empty hash buckets.

Step 3: Add 9000 rows

Let us add few more thousands - say 9000 rows. Now table contains 10,000 rows but only 1024 hash buckets

Insert into hash_collision(dt) select getdate()
go 9000

SELECT * FROM sys.dm_db_xtp_hash_index_stats
WHERE object_name(object_id) = 'hash_collision'



  Notice a sharp increase in "Average Chain Length" as there more values than the number of buckets. The number of empty buckets is obviously zero.

So does longer hash chain affect the performance? Longer hash chains cause reads or the index scan to be slower. So, it is important to pick the hash bucket count carefully. General recommendation is at least 2 times the number of distinct values in the table. Also, it is always better to over size the hash bucket count instead of under sizing it.