Tuesday, September 15, 2015

Hekaton Part 7 - Hash Collisions & Hash Buckets

As explained in the earlier post, a hash index places pointers to the rows in hash buckets, depending upon the value returned by hash function. Hash bucket count for a index is fixed at the time of creation and doesn't grow as data grows. So, what happens when the index column contains duplicates and they hash to the same hash bucket? Sometimes, two values, even though they are different, they may hash to the same hash bucket. Such a scenario is termed as hash collision.

Hash Indexes in "Hekaton in memory tables" handles the hash collision in the following way
1) When two values hash to the same hash bucket(h1), the pointer to the latest row (R2) inserted is stored in the hash bucket
2) The latest inserted row(R2) would contain a pointer to the row(R1) that was present previously  in hash bucket(h1).

Refer to the pic below

Hash Index is created on column c1. Row with value "Alice" for Column c1 resides at address 2000. Hash("Alice") maps to hash bucket 1 and hence hash bucket 1 contains the address 2000.

Refer to the pic below, to see what happens when another row with value "Alice" for c1 is inserted.

Lets say the new row inserted is at address 3000. Hash bucket "h1" now contains the address 3000. The new row's meta data ( index pointer)  at address 3000, contains a pointer to the previous row at Address 2000.
Please note 2 important points here.
  1. In "in memory" tables, rows contain index pointers unlike disk tables where indexes contain pointers / row locators to table.
  2. These index pointers actually link the table together. This is the reason for atleast one index rule in any in memory table
In the next post, we will see a few demos to take a closer look at hash collision and index chains

Tuesday, September 8, 2015

Hekaton Part 6:- Hash Indexes - Intro

With SQL Server 2014, new type of Index called the hash index was introduced. Little introduction to hash function would help understanding hash indexes better.

When a "key value" is passed to hash function, depending upon the result hash function provides, the key value will be placed in corresponding hash buckets.

For example, let us assume modulo ten ( % 10 ) is the hash function. if key value 1525 is passed to hash function, then it will be placed in hash bucket 5 as 1525 % 10 = 5. Similarly 537 would map to bucket 7 and 2982 would map to hash bucket 2 and so on..

So, similarly, in hash indexes, the indexed column value is passed to hash function and depending upon the result, a pointer to the actual row is stored in the table are stored in hash buckets. So, to find a row or to process a where clause, all that sql engine needs to 
do is

  1. Apply the hash function on the search parameter
  2. Go to the hash bucket where the row pointer is stored
  3. fetch the result.
Simplicity and no tree traversals makes it one of the fastest methods of access.

Sample hash function usage or syntax provided below.

( [ID] Int identity(1,1) Not null PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 100000),
[Data] char(32) COLLATE Latin1_General_100_BIN2 null,
[dt] datetime not null,

As Indexes can't be added after table creation in "in memory tables" in SQL Server 2014, one has to state the index definition inline. SQL Server 2016 supports adding indexes after table creation, but it is still a offline operation.

Bucket count specifies number of buckets to be used for the index. Bucket count is fixed and is always rounded off to next power of 2 ( 1024, 2048, 4096 etc..)

In subsequent posts, we get into more details on hash indexes and its pros and cons, hash bucket count and many more.


Friday, September 4, 2015

Hekaton Part 5: Introduction to Indexes in "In Memory Tables"

After a short little break on Hekaton series, we are now back with the 5th post.

Two new index types have been introduced. They are

1) Hash Index -  Based on Hash algorithm - useful for equality searches
2) Range Index - sometimes referred as Non clustered Index in Memory tables - Similar to traditional Non clustered indexes and useful for range query searches

Indexes on "In Memory" tables are never physically stored in the disk. They are always in purely memory structures. Indexes are re created and loaded into the memory every time database is restarted or when the database restored.

Besides providing the traditional performance improvements, Indexes in " in memory table" another major role. Indexes are actually the chain or the link which holds the entire table together. In other words, each row in the table is linked to the next row using indexes. So, Indexes are a must for in Memory tables. Just to add on

1) On a Schema and Data in memory table ( table which persists on disk ) , primary key is a must. Primary key can be enforced by hash or range index

2) On a schema table ( the table doesn't retain the data after restarts. Only structure is persistent ), there should be atleast one index

 Please note that existing concept of cluster index is not present in "In Memory tables". Also, as they are purely stored in Memory, impact of fragmentation, page splits are bare minimum or non existent in "In memory" tables.

Intro to indexing done. Next post will deal with hash indexes in detail.

Monday, August 31, 2015

At SQL Server Geeks 2015 - Pre Conference Session - Benjamin Nevarez

Attended SQL Server Geeks Annual Summit 2015 on last Wednesday (26/8) at Bangalore, India. Due to few urgent responsibilities at work, could attend only Pre Conference event on Wednesday, skipping the events on Thursday, Friday and Saturday.

The Preconference event I attended was "SQL Server Optimization and Tuning" by Benjamin Nevarez :) No prizes for guessing why this session.. Of course for Mr.Optimizer Benjamin Nevarez!!! I have been a reader of SQL Server Internals Series of books since SQL Server 2005 and I am currently reading his SQL Server 2014 query tuning book and was obviously too thrilled to meet him and learn more from him. And yes, it did happen!!!

So, now getting to the actual session. The session was scheduled to start at 9 AM and Benjamin sir and most of the attendees were in at 8:45 itself. Session was off to a flying start, with Benjamin sir explaining importance of optimizer right from the starting days of relational database. After the gentle start, Benjamin sir wasted no time in getting into depth details of the optimizer. The scripts and examples showing things like transformation rules, experiments playing with transformation rules ( turning off few of them and saying "what  happens now" ) and usage of all the undocumented trace flags were mind-blowing. It gave the attendees a deep insight and a little tour inside the query optimizer itself. Genuine level 500 stuff which only Benjamin sir and probably  just a few handful of others in SQL Community are capable of doing.

On the post lunch session, we were scheduled to look at a array of topics like Indexes, Statistics, Plan reuse and parameter sniffing, Cardinality Estimator, Hekaton, & column store indexes. That's quite a lot for half a day ( 4 hours ) and hence perhaps Benjamin sir didn't go so much in detail in the sessions post lunch. I would have wished for few more 500 level stuff but due to time constraints and also to reach out to entire spectrum of audience, I guess it was designed to be a 300 level session. Nevertheless, it helped me clarify, one of those small things or little doubts which one has, even on the topics which you know the most.

Overall, the pre conference session was a great learning experience not just on a technical front, even on performing technical presentations. The way Benjamin sir was prepared for the sessions, his time management, the scripts, the demos , the organization of presentation slides was simply perfect. And needless to say, meeting a legend of the SQL Community like Mr.Benjamin Nevarez is absolutely priceless. So, overall a event which I would remember for long. Special thanks to Amit Bansal and his team for making this happen.

Other arrangements like food and beverages, organizers to look for any help, registration and directions to the hall were neatly handled. Overall a well managed event and kudos to the entire SQL Server Geeks team for their efforts!!! Few pictures which were taken by photographers of the event shared below.


Tuesday, August 25, 2015

Query cost and Query Execution time - Difference

This post is my 100th post :) Was inactive in blogging for few years in between. Now back in full flow. Hope the 200 doesn't take this long !!!

When analysing query plans, one would have noticed the Query cost percentage ( which appear on top of each query in estimated and actual query plans). Query Cost percentages indicate the resource consumed by each query relative to the batch of queries that were executed.

The point to note is highest query cost doesn't mean that the query took longest time to run, it just means that the query highest amount of resources ( CPU, I/O or Memory ). Query with the highest cost may not always be the longest running query. Please refer to this example. Run the following script with the option "Include Actual Query Plan"


Declare @dt Datetime

SET @dt = getdate()

WAITFOR DELAY '00:00:10'

select * from sys.databases where database_id = 5

Print 'time spent in First Query - '

Print Datediff(ms,@dt,getdate())

Print char(10)

/*************** End of First Batch ******************/

SET @dt = getdate()

select syscolumns.name,sysobjects.name,syscolumns.*

from syscolumns, sysobjects where syscolumns.id = sysobjects.id

and sysobjects.xtype = 'u'

order by sysobjects.crdate

Print 'time spent in Second Query - '

Print Datediff(ms,@dt,getdate())

We would use query plans to compare the query cost of first and second queries. The Variable @dt used in the script would be used to track the time taken to run each query.

The first query set of queries, have a delay of 10 seconds and finishes little over 10 seconds. While the second query finishes in 140 milliseconds. Refer to the text result below

time spent in First Query -


time spent in Second Query -


Comparing the query plans, the first query consumes just 12% of the total cost of the batch of queries while second query being little complex consumes 88% of total query batch cost, in spite of the first query taking longer to complete


Friday, August 14, 2015

Hekaton Part 4: Simple Performance comparison - In Memory vs Disk table

Continuing on the Hekaton Series, this post will attempt a very basic performance comparison of In Memory and disk based table.

Table Creation and Data Loading

CREATE TABLE dbo.Disk_tbl
[ID] Int identity(1,1) Not null PRIMARY KEY CLUSTERED,
[Data] char(32) COLLATE Latin1_General_100_BIN2 null,
[dt] datetime not null
Insert into Disk_tbl(data,dt) Select 'xyz',getdate()
GO 10000
Insert into Disk_tbl(data,dt)
Select 'xyz',getdate() from Disk_tbl
GO 10

A Similar script for Hekaton table is provided below. Note that clustered primary key is replaced with Hash Index as cluster indexes are not supported in memory tables.

[ID] Int identity(1,1) Not null PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 100000),
[Data] char(32) COLLATE Latin1_General_100_BIN2 null,
[dt] datetime not null,
Insert into HK_tbl(data,dt) Select 'xyz',getdate()
GO 10000
Insert into HK_tbl(data,dt)
Select 'xyz',getdate() from HK_tbl
GO 10

Performance comparison: 

   Please note that memory needs to be cleaned to have a
   genuine comparison between disk based table and In 
   memory table. "DBCC DROPCLEANBUFFERS" helps us on
   the same.  The query just picks one row out of 10 Million
   rows by filtering on the primary key column
   Declare @dt Datetime
   SET @dt = getdate()
   Select * from HK_tbl where id = 100234
   Print datediff(ms,@dt,getdate())
   SET @dt = getdate()
   Select * from disk_tbl where id = 100234
   Print datediff(ms,@dt,getdate())

Performance comparison by time:

Result of the script provided below

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

(1 row(s) affected)


(1 row(s) affected)

Table 'Disk_tbl'. Scan count 0, logical reads 3, physical reads 3, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


In Memory tables" take 0 ms compared to 16 ms on disk based tables.

Performance comparison by IO:

Statistics I/O on doesn't work on In Memory tables as they seldom involve physical I/O. Disk based tables indicate 3 Physical I/Os.

Performance comparison by Cost:

Query plan indicates that query Disk based table had 99% of query cost while "In Memory" table took only 1% of total resource utilized.

On all counts, "In Memory" tables do outperform disk based tables. However, the scenario considered is a simple scenario and as this series progresses we will analyse with more in depth details and scenarios.