Tuesday, May 25, 2010

Performance comparison - Functions versus Joins

User defined Scalar function is often chosen as a easier alternative for joining two or multiple tables while writing queries by developers. Other reason given by 'Few' developers is that its provides better readability on a long query. Both are bad reasons for using functions as scalar functions when used inappropriately can potentially kill the performance.

Scalar function when used on a query is executed for every row of the result set, just like a cursor would do. Row by Row processing is very very expensive when compared to SET based processing provided by joins. Scalar functions also makes indexes on the joined tables less effective or even useless.Above factors make joins, in general, a better option when they can replace a scalar function.

We will see the same with an example.

Consider the following tables student , student_marks . Student table contains student details and student_marks contain marks obtained by students on 5 subjects. Table definition and random data population scripts are provided below.

/************* Table [dbo].[student_marks]******/
CREATE TABLE [dbo].[student_marks]
     [sid]          [INT] NOT NULL,
     [marks]        [INT] NULL,
     [subject_name] [VARCHAR](50) NOT NULL,
     [subject_id]   [INT] NOT NULL,
     CONSTRAINT [PK_student_marks] PRIMARY KEY CLUSTERED ( [sid] ASC,
     [subject_id] ASC )

/****** Table [dbo].[Student] ******/
CREATE TABLE [dbo].[Student]
     [sid]          [INT] NOT NULL,
     [student_name] [VARCHAR](50) NULL,

Random data population scripts

/*************Student table data population*********/
SET @id = 0
WHILE @id <= 100000
      INSERT INTO student
      SELECT @id,
             CHAR(( @id % 26 ) + 65) + '-' + CONVERT(VARCHAR, @id)

      SET @id = @id + 1

/*************Student_marks table population**********************/
INSERT INTO student_marks
       Round(Abs(( Rand(sid + x.subid) * 1000000000000 ) - Round((
                       ( Rand(sid + x.subid) *
                         1000000000000 ) / 100 ), 0) * 100), 0),
FROM   student,
       (SELECT 'math' AS subject,
               1      AS subid
        SELECT 'physics',
        SELECT 'chemistry',
        SELECT 'computer_science',
        SELECT 'tamil',
               5) AS x 

The Scripts provided would insert 100,001 rows in student table and 500,005 rows student marks table.Primary key clustered index has been created on sid column on student table and sid,subject_id columns on student_marks table.

Assume that the requirement is to provide the student names and marks of all students in the school.Solution using scalar function would involve creating a function to get student name from student table and querying the student_marks table for obtaining rest of the details. Simple function definition would look like

/****** Object:  UserDefinedFunction [dbo].[student_name]    ******/
SET ansi_nulls ON

SET quoted_identifier ON

CREATE FUNCTION [dbo].[Student_name](@sid INT)
      DECLARE @student_name VARCHAR(50)

      SELECT @student_name = student_name
      FROM   student
      WHERE  sid = @sid

      RETURN @student_name

After creating the function, the following query using scalar function is executed.

SET statistics io ON

SET @dt = Getdate()
SELECT dbo.Student_name(sid),
FROM   student_marks
WHERE  sid < 20000

SELECT Datediff(ms, @dt, Getdate()) 

Perfomance Stats are provided below:
Time taken :55640 ms ( 0:56 Seconds)
Rows Returned :100000
IO details : Table Table 'student_marks'. Scan count 1, logical reads 415, physical reads 56, read-ahead reads 412, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Query plan:

The query plan doesnt show whats exactly happening with in the function/
how many times was the function precisely called.So, to find the same, let me use the DMV dm_exec_query_Stats

SELECT Substring(TEXT, ( statement_start_offset / 2 ) + 1, ( ( CASE
              WHEN -1 THEN Datalength(TEXT)
              ELSE statement_end_offset
       - statement_start_offset ) / 2 ) + 1) AS statement_text,
       Isnull(Object_name(objectid), '-')    AS obj_name,
FROM   sys.dm_exec_query_stats
       OUTER APPLY sys.Dm_exec_sql_text(plan_handle) AS sqltext
WHERE  TEXT NOT LIKE '%dm_exec_query_Stats%'
ORDER  BY execution_count DESC 

Execution count column on dm_exec_query_stats DMV clearly shows that the query Select dbo.student_name(sid),* From student_marks where sid < 20000 in the function dbo.student_name has been executed 10000 times , which is exactly the same number of rows returned. So its basically passing a number ( sid ) and making the functions' code execute over and over which makes it painfully slow.

Let us Consider the join based solution:

SET statistics io ON

SET @dt = Getdate()
SELECT student_name,
FROM   student_marks,
WHERE  student_marks.sid = student.sid
       AND student_marks.sid < 20000
SELECT Datediff(ms, @dt, Getdate()) 

Performance Stats :
Time taken : 2876 ms ( 2.9 Seconds)
Rows Returned: 100,000
IO details : Table 'student_marks'. Scan count 1, logical reads 415, physical reads 3, read-ahead reads 412, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Student'. Scan count 1, logical reads 60, physical reads 1, read-ahead reads 58, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Query plan:

Join based solution completes the query in less than three seconds, which is 20 times faster than Scalar functions. IO also indicates lesser number of reads. The student table is also read only once against 100,000 times as in the case of scalar functions. Query Stats DMV confirm the same.

So the conclusion is, one should always avoid using scalar functions as much as possible and should be used only when we don't have any other option to replace it.Complex string manipulation requirements or date calculation requirements are perhaps, plausible excuses for Scalar functions, but still they have to be used with caution as they have been found as performance bottlenecks more often than not.

No comments: