SQL Server: Index in sql server

What is an index in sql server?

Index is a way to organize data in a table to make some operations like searching, sorting, grouping etc faster. So, in other word we need indexing when sql query has:

1. WHERE clause (That is searching)

2. ORDER BY clause (That is sorting)

3. GROUP BY clause (This is grouping) etc.

Table scan:

Let us assume we have a student table with following schema:

CREATE TABLE Student(

RollNo INT NOT NULL,

Name VARCHAR(50) NULL,

Country VARCHAR(50) NULL,

Age INT NULL

)

And its data as follow:

RollNo	Name	Country	Age
101	Greg	UK	23
102	Sachin	India	21
103	Akaram	Pakistan	22
107	Miyabi	China	18
108	Marry	Russia	27
109	Scott	USA	31
110	Benazir	Banglades	17
111	Miyabi	Japan	24
112	Rahul	India	27
113	Nicolus	France	19

Right now student table has not any index.

SELECT * FROM Student WHERE RollNo = 111

If we execute above sql query, it will go for table scan i.e. it will scan the RollNo column from beginning to end until it doesn't get the record where RollNo = 111. In this way from RollNo 101 to 111, it has to scan 8 times. If the table has too many rows, say 100K, then the number of scan will increase drastically.

Time complexity of table scan is : O(n)

If we observe estimated execution plan it will look something like:

As we observed, table scan is the least efficient way of scanning the record and hence sql server has introduced index to optimize the scanning process i.e. to decrease the number of scan and hence making the operations faster. In sql server there are two type of index:

1. Clustered index

2. Non-Clustered index
What is clustered index in sql server?

When we create a clustered index on any table, physical organization of data in the table is changed. Data in the table are stored as binary search tree(B tree). For example if we will create the clustered index named CL_Student_Roll with RollNo column of Student table as key column of clustered index, physical organization of data in the table will look something like this :

Description of above image:

It is B Tree representation where top node is called root node and bottom nodes are called leaf nodes. All the nodes in between are called intermediate nodes. B Tree can have many intermediate nodes if table has too many records. In this example there are no intermediate nodes and we have assumed size of each node is five i.e. each node can keep maximum five roll numbers. In clustered index leaf nodes keep the memory address of actual (physical) data row.

Now if we will execute following sql query:

SELECT * FROM Student WHERE RollNo = 111

Since Student table has clustered index so now it will go for index scan instead of table scan to search the RollNo = 111. Scan will start from root node in the clustered index CL_Student_Roll .

Scanning steps:

Scan 1: 107 < 111 so it will go to next element of root node.

Scan 2: 110 < 111 so it will go to the next element of root node. Since there is not any other element in root node so it will follow right most link or path.

Scan 3: In leaf node, 110 < 111 o it will go to next element of leaf node.

Scan 4: In leaf node, 111 = 111 so it will follow this path and get the actual data.

Using clustered index we can access the actual data only in 4 scanning while using table scan we had to follow 8 scanning. If table has too many rows then numbers of scan will decrease drastically as compared to table scan.

Time complexity of index scan is: O(log n)

If we observe estimated execution plan, it will look like:

Now table scan has changed to index seek. So it will scan using clustered index CL_Student_Roll

Simple syntax for creating clustered index:

CREATE UNIQUE [CLUSTERED] INDEX <IndexName>

ON <ObjectName>(

<ColumnName> [ASC | DESC ] [ ,...n ]

)

Explanation of each term:

UNIQUE: It is optional. We can specify UNIQUE only when there is no duplicate data in key columns.

<IndexName>: It can be any valid name of the index.

<ObjectName>: It is the name of object on which we are creating index. It can be either table name of view name.

<ColumnName>: It is name of key column. We can specify more than one key columns which is called composite key.

[ASC | DESC]: To specify the order of key column in ascending or descending order.

Some important points about clustered index:

1. If we create table with primary key, sql server automatically creates clustered index on that table. No need to create clustered index manually.

2. A table can have only one clustered index since physical organization of data in a table is same as clustered index and any table can have only one physical organization of data.

3. If clustered index has more than one partition then each partition has a B-tree.

Creating a simple clustered index on student table:

CREATE UNIQUE CLUSTERED INDEX CL_Student_Roll

ON Student(RollNo)

Terms related to estimated execution plan:

Table scan: It is very slow and is used only if table has no clustered index.

Index scan: It is also slow. It is used when table has clustered index and either when non-key columns are present in WHERE clause or when query has not been covered (will discuss later) or both.

Index Seek: It is very fast. Our goal is to achieve this.

If we will keep mouse on any icon in execution plan (GUI) a detailed screen will appear like this:

Terms related to detail section:

Predicate:

It is condition in WHERE clause which is either non-key column or column which has not been covered.

Object: It is name of the source from where it is getting data. It can be name of table, Clustered index or non-clustered index.

Output list: It is a list of names of the columns which it is getting from object.

Seek Predicate: It is condition in WHERE clause which is either key column or fully covered.

Our goal is to move the conditions to the predicate section to seek predicate.

Consider the sql query:

SELECT RollNo FROM Student WHERE Age = 24

If we observe estimated execution plan it will look like:

Predicate is:

[DbName].[dbo].[Student].[Age] = CONVERT _IMPLICIT ( int, [@1], 0)

Object or source is: Cl_Student_Roll that is clustered index we have created.

So it is obvious that it is scanning using clustered index and hence it is index scan. So it is less efficient since searching is being done using the column age which is non-key column of clustered index CL_Student_Roll. Its key column is RollNo. Time complexity of index scan is same as table scan i.e. O (n). As we know table can have only one clustered index, to make such type of query more efficient sql server has introduced non-clustered index.

Important points about clustered index:

1. In create table statement we can specify CLUSTERED and NONCLUSTERED keyword on only those columns which have either PRIMARY KEY or UNIQUE constraints. For example:

CREATE TABLE Emp(

ID INT PRIMARY KEY CLUSTERED,

RegNo INT UNIQUE NONCLUSTERED

)

Default value of PRIMARY KEY constraints is CLUSTERED if table has no existing clustered index or if we have not specified CLUSTERED index on any columns of UNIQUE constraint. Default value of UNIQUE constraints is NONCLUSTERED

2. Creating a primary key on a table also creates a unique non-clustered index of table if table has any existing clustered index.

3. A table or view can have only one clustered index.

4. Physical order of rows of table is same as logical order of key columns of clustered index.

5. Leaf nodes of clustered index keep actual data rows.

6. When we create a clustered index on table which has existing non-clustered indexes, all non-clustered indexes are re-built after creation of clustered index. In other words we can say initially non-clustered indexes which are based on heap are converted into clustered index

What is non clustered index in sql server?

It is logical organization of data in a table. A table can have 1023 non-clustered index. A non-clustered index can be of two types as:

1. Heap

2. Based on clustered index.

If table has clustered index then leaf node of non-clustered index keeps the key columns of clustered index. If the table has no clustered index then leaf node of non-clustered index keeps RID which is unique to each row of table. Such non-clustered index is called heap.

Non-clustered index based on clustered index:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0Zd6_uBmZs9YWgtVDkTepgSvGA5w1pjsNIYV0EVzEcvOCS7BTwQtmYHryLO2GTrEoC-ieijeCmFfMUObzShQDHzz854JVq42Zk8jjEB81rEjVrvDtyQZlZ4Y-MXQju1HXzG24VG18UNw/s640/nci2.png

Description of above image:

It is the B-tree representation of non-clustered index NCI_Student_Age where lower section of dotted line is also B-tree of clustered index CL_Student_Roll. Top node of non-clustered index is called root node and bottom nodes are called leaf node. All the nodes in between root node and leaf nodes are called intermediate nodes. A B-tree can have many intermediate nodes in case table has too many records. In this example there is no intermediate node and we have assumed size of each node is four i.e. each node can keep maximum five Age.

Important points about non-clustered index based on clustered index:

1. All the nodes keep key columns of non-clustered index and key column of unique clustered index.

2. If clustered index is not unique then instead of key column of clustered index it keeps a unique row identifier of the actual data rows of table.

3. Apart from points above, a leaf node also keeps the included columns of non-clustered index. (We will discuss about included column later)

Now if we will execute following sql query:

SELECT * FROM Student WHERE RollNo = 111

Since Student table has clustered index, now it will go for index scan instead of table scan to search the RollNo = 111. Scan will start from root node in the clustered index CL_Student_Roll.

SELECT Country

FROM Student WITH (INDEX (NCI_Student_Age))

WHERE Age = 31

Note: Here we have added WITH clause only because sql server uses the non-clustered index NCI_Student_Age, otherwise it will not use non-clustered index or will use if table has too many records. We will discuss why is it so later.

Execution plan of above query will look like this:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxTjJMKxtwrYVGIDy0cW5k0wJmUhZe7jHLd_3cXJRM_mBqspjE277oDiFY9YbTPnyWE72dVjf8L_Im2rZ5dCOjErWPfsJCjI2RnHwg4jpvvzdldUUKgthyphenhyphenWNY0kWYg1xTUryPSHUBV-y4/s640/expnci.png

Meaning of above execution plan:

Sql server will first use non-clustered index NCI_Student_Age and find out the corresponding value of key column i.e. RollNumber of clustered index (It is called key look up) using nested loop and find out the value of column Country.

Scanning steps:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbLpTgI8TACtGtGHuom-uhQaL07o0Vcs5ltvsqBTP9aV9MGodVXSEp63CxKuPp6kB6aSqdChgsNvDxO0H69tzkmetYYA_fsqmJSFHuO02v2bRzBI7wziXIUZlUrM6Li3fVa3Zoo7uj7ys/s640/nci56.png

Scan 1: 22 < 31 so it will go to next element of root node.

Scan 2: 24 < 31 so it will go to the next element of root node. Since there is not any other element in root node so, it will follow right most link or path.

Scan 3: In leaf node, 24 < 31 o it will go to next element of leaf node.

Scan 4: In leaf node, 24 < 31 o it will go to next element of leaf node.

Scan 5: In leaf node, 27 < 31 o it will go to next element of leaf node.

Scan 6: In leaf node, 27 < 31 o it will go to next element of leaf node.

Scan 7: In leaf node, 31 = 31. It will get value key column of non clustered index i.e. RollNo which it equal to 109.

Scan 9: From here it will use clustered index CL_Student_Roll to get the values of column Country name of RollNo = 31. In root node of clustered index 107 < 109 so, it will go to next element of root node.

Scan 10: 110 > 109 so it will go through this link.

Scan 11: In leaf node, 108 < 109 so, it will go to next element of leaf node.

Scan 12: In leaf node, 107 < 109 so, it will go to next element of leaf node.

Scan 13: In leaf node, 109 = 109 so it will follow this path and get the actual data that is value of column Country which is USA.

Simple syntax of creating a non-clustered index

CREATE [UNIQUE] NONCLUSTERED INDEX <IndexName>

ON <Table_or_view_name>(<ColumnName_Key>[ASC|DESC] [ ,...n ])

[INCLUDE (<ColumnName_Include> [,...n])]

<IndexName>: It is any valid name of index.

<Table_or_view_name>: It is name of the table or view on which we want to create the non-clustered index.

<ColumnName_Key>: It is name of the key columns for clustered index separated by comma. Values of these columns are present in each node of a clustered index. It has following restrictions:

1. We can specify maximum 16 column names.

2. Sum of size of the columns cannot be more than 900 bytes.

3. All columns must belong to same table.

4. Data type of columns cannot be ntext, text, varchar (max), nvarchar (max), varbinary (max), xml, or image

5. It cannot be non-deterministic computed column.

<ColumnName_Include>: Due to above restrictions on <ColumnName_Key> columns, we can include more columns of table or view in the INCLUDE section. Date of include columns are only present in the leaf node of clustered index. It has also some restrictions:

1. Data type of columns cannot be text, ntext, and image.

2. A column cannot be present in key columns as well as can not include column at the same time. Also repetition of columns are not allowed.

Good practice: If possible we should try to add all columns of WHERE clause in key columns list and all columns of SELECT clause in the INCLUDE columns of a non-clustered index.

Covering of queries:

If we include all the the columns of a SELECT statement either as key columns or INCLUDE columns in non-clustered index such queries are called covered query. For example consider on query:

SELECT Name,Country

FROM Student

WHERE RollNo = 108 AND Age = 27

Non-clustered index 1:

CREATE NONCLUSTERED INDEX NCI_1

ON Student (RollNo,Age)

This index doesn't cover the above SELECT statement.

Non-clustered index 2:

CREATE NONCLUSTERED INDEX NCI_2

ON Student (RollNo,Age,Name)

This index doesn't cover the above SELECT statement.

Non-clustered index 3:

CREATE NONCLUSTERED INDEX NCI_3

ON Student (RollNo,Age,Name,Country)

This index covers the above SELECT statement.

Non-clustered index 4:

CREATE NONCLUSTERED INDEX NCI_4

ON Student (RollNo,Age)

INCLUDE (Name,Country)

This index covers the above SELECT statement. Now consider on last index we have created that is

CREATE NONCLUSTERED INDEX NCI_Student_Age

ON Student (Age)

This index doesn't cover the sql query:

SELECT Country

FROM Student

WHERE Age = 31

Due to which sql server doesn't use the index NCI_Student_Age. Execution plan of this query will look like:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpBeKU2IQp1UNLWkTALHbHvJev_2bK9NWufX81v7R9G9U7mf48468xRrlk3syl5-D4FBIa7RSE6J2YTp266VpT51Pw5UNm79xecxmRj-covAnrd4XKLce1h0jbvClxR5X-9VBCRaMGbc4/s640/nci23.png

Sql server follows Index scan which is same as table scan and it is a slow process.

If we will use forcefully non-clustered index NCI_Student_Age as

SELECT Country

FROM Student WITH (INDEX (NCI_Student_Age))

WHERE Age = 31

Its execution plan will look like:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNlF8Fgyl1ohDpiK2ab5aUtH16BZ4U3f2uwlXDUirevY0Nbnm4F0HZPBUMNOlj_q7LJRwNVBxfVTStPta5pZbjVV_PldiG6ZXyvVmpfmNaDCq5ax-SV50s0ENlX1LHEsUYAXRcoFi2hLo/s640/expnci.png

Here problem is it is partiality index seek process that is, to get value of column Country it has to go through (key look up) the clustered index CL_Student_Roll since non-clustered index doesn't keep the value of column Country. We can optimize this query by creating a new index which covers the SELECT statement.

CREATE NONCLUSTERED INDEX NCI_Student_Country

ON Student (Age)

INCLUDE (Country)

After creating this index, execution plan will look like:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUtmvjKuYvMKUdzUj4HgtCWFBUvokV3aJ251l3YQ_RwcrXh91xG7sQNF0tEHD9PyGZw6xz3pobyzntZbm0eLJhppeNIKGjvTWOy7kDED9-lpn99dM_uba3EVX_DAXM3pYkzz7eD4Tw6Kw/s640/nci24.png

Now there is no key look up and it is fully index seek. Structures of index NCI_Student_Country will look like this:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiizso33fvfgItH9j8Nzs13tiLy0od5qJJ6ejYw1QNJ7Z-IB46cQPI23N7iexUEjtiMhtEhPU_DqNp8EPWdJs1PSOlgBmt9QWzXoHFQjwofVFZLq4-qOhnRi22nU7bwQtxsz4R2ymhIDhA/s640/covered.png

We can observe key columns of clustered index keep all the nodes while INCLUDE columns are only kept by leaf node. To perform this query:

SELECT Country

FROM Student

WHERE Age = 31

Sql server will not use clustered index since data of Country columns are present in non-clustered index NCI_Student_Country. So, there is no need of key look up.

Non-clustered index based on heap:

A table without any clustered index is called heap. It is possible to create a non-clustered index on such tables which do not have any clustered index. Main difference between non-clustered index on clustered index and non-clustered index on heap is, it keeps the RID (i.e. Row ID) in the leaf node of non-clustered index instead of key column of clustered index which is pointer to the actual data rows. Suppose we have a Student table which do not have any clustered index. Now we want to create a non-clustered index:

CREATE INDEX NCI_Heap

ON Student (Age)

If we will check the execution plan of this query:

SELECT Country

FROM Student WITH (INDEX (NCI_Heap))

WHERE Age = 31

It will look something like this:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiys61IPbmIXNM3nrp7yWnv8zYZ9_2wdPyPyeNhLbe_gfWE7seFt5YNTL9Yl79tFWgqOYK4EKSskcg7gS5YUIs0B4slcA63qG9-mjuhQCbzLKQY5ItpWSUHzZyRl7-KU8B_MN6e3TLKhTA/s640/heapexep.png

From execution plan, we can observe that it is using RID look up instead of key look up.

Note: RID of each row of a table is generated by sql server using file identifier, page number and number of the row on the page etc.

Organizations of non-clustered index on heap will look something like this:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0MBn3hJHM-Kkz2ArKbLndJtiL6TOfTmCXLoO5F_m70zJhu94SnMsDQ_IMH0uKgSpX4gYVzqfKv2MQUzPPdfgpWvWHwPRNeYbhvlCIQbiTP7Y4nsnNmHELOf6g3gM5PHsVLKr2j8QaGYI/s640/ridlookup.png

We can remove the RID look up in the execution plan by covering the query in non-clustered index on heap that is:

CREATE INDEX NCI_Heap_Country

ON Student (Age)

INCLUDE (Country)

Now execution plan of query:

SELECT Country FROM Student WHERE Age = 31

Will look like:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFAfdtPXfCCRILkFqUUM6gSqg-26vIqPDrjKvAyIqla4hRbyRhrNG_9eoUVPGfgk5njy0iPU4KjRunl4xHDYD79dsJHvU0LXnw4yMBilM7Qmd6iHLUq2rP0_Qfd02kkWsjUOQycav5SGg/s640/HEAPEXPCOUNTRY.png

SQL Server

Pages

Saturday, June 1, 2013

Index in sql server

No comments:

Post a Comment