What is a Clustered Index in SQL Server?

There are around a dozen types of index in SQL Server, but the most commonly used index type would have to be the clustered index. By default, a clustered index is automatically created when we create a primary key, and so if your tables have primary keys, then there’s a pretty good chance you’ve got clustered indexes all over your database.

But what exactly is a clustered index?

Let’s delve into what a clustered index is, how it works, its relationship with primary keys, and why it’s so important for our database performance.

Understanding Clustered Indexes and Primary Keys

A clustered index is a special type of index that reorders the way records in the table are physically stored. Unlike nonclustered indexes, which are stored separately from the data rows, a clustered index determines the physical order of data in a table.

Clustered indexes are often likened to a dictionary, where the whole dictionary is in a set order, and so we can quickly locate items because we know the order. Another way to think about it is like the table of contents at the front of the book. The table of contents shows how the information book is organized. It shows us the order in which the content is set out in the book. A book can only be arranged in one order.

The nonclustered index on the other hand, is like the index at the back of the book. It’s generated separately, and it contains words and terms arranged in alphabetical order, so that we can quickly find the word/term we’re interested in, and quickly find it’s location in the book.

Key Characteristics of Clustered Indexes

Here are the main characteristics of clustered indexes:

  • Unique per Table: Each table can have only one clustered index because the data rows themselves can be stored in only one order.
  • Physical Order: The clustered index sorts and stores the data rows in the table based on their key values.
  • Automatic Creation with Primary Keys: In SQL Server, when we define a primary key, a clustered index is automatically created on the primary key column(s) unless we specify otherwise or a clustered index already exists.
  • Leaf Level = Data Pages: In a clustered index, the leaf level of the index contains the actual data pages of the table.

Primary Keys

Before we dive deeper into clustered indexes, let’s briefly discuss primary keys, as they’re closely related:

  • A primary key is a column or set of columns that uniquely identifies each row in a table.
  • It enforces entity integrity by ensuring that no two rows can have the same key value.
  • In SQL Server, by default, creating a primary key constraint automatically creates a clustered index on the same column(s) if no clustered index already exists on the table.

How Clustered Indexes Work

When we create a clustered index on a table, SQL Server sorts the table data in order based on the clustered index key. This sorting process affects the physical storage of the data in the table’s data pages.

Here’s a step-by-step explanation of how it works:

  1. We create a clustered index on a column or set of columns (often the primary key).
  2. SQL Server sorts all the rows in the table according to the clustered index key.
  3. The sorted rows are then distributed into pages, maintaining their sorted order.
  4. SQL Server builds a B-tree structure to help quickly locate specific rows based on the clustered index key.

Benefits of Clustered Indexes

Here are some of the main benefits of having a clustered index on our tables:

  • Faster Data Retrieval: For queries that select data based on the clustered index key (often the primary key), data retrieval is typically very fast.
  • Range Queries: Clustered indexes are particularly efficient for range queries because the data is physically sorted.
  • No Additional Storage: Unlike non-clustered indexes, clustered indexes don’t require extra storage for the index structure (except for the B-tree).
  • Improved I/O Performance: Since related data is stored together, it often requires fewer I/O operations to retrieve related data.
  • Automatic Sorting: When the clustered index is on the primary key, inserting new records automatically maintains the data in sorted order.

Considerations When Using Clustered Indexes and Primary Keys

While clustered indexes offer significant benefits, we should keep a few things in mind:

  • Choose the Clustered Index Key Wisely: Since we can have only one clustered index per table, the choice of key is crucial. It’s often beneficial to use the primary key as the clustered index, but not always.
  • Avoid Frequently Changing Keys: If the clustered index key values change frequently, it can lead to page splits and fragmentation, which can degrade performance. This is one reason why surrogate keys (like identity columns) are often used for primary keys.
  • Impact on Non-Clustered Indexes: Non-clustered indexes store the clustered index key as a row locator. If our clustered index key is large, it can inflate the size of non-clustered indexes.
  • Consider Separating Primary Key and Clustered Index: In some cases, it might be beneficial to have the primary key on one column (or set of columns) and the clustered index on another. For example, if we frequently run range queries on a date column, we might make that our clustered index even if it’s not the primary key.

Example: Heap vs. Clustered Index

Let’s look at an example to illustrate the difference between a heap (a table without a clustered index) and a table with a clustered index. We’ll also look at how adding a primary key (with an associated clustered index) to the table affects the row order.

Suppose we have a simple table called Employees:

CREATE TABLE Employees (
    EmployeeID INT NOT NULL,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    HireDate DATE
);

INSERT INTO Employees VALUES
(3, 'Brad', 'Betcha', '2022-01-15'),
(1, 'Amy', 'Hock', '2021-05-20'),
(4, 'Fitch', 'Jockland', '2023-03-10'),
(2, 'Sasha', 'Pratt', '2022-11-05');

We didn’t specify a primary key for this table, or a clustered index, and so this table is a heap.

If we select from this table, the rows might be returned in the order they were inserted:

SELECT * FROM Employees;

Result (Heap):

EmployeeID  FirstName  LastName  HireDate                
---------- --------- -------- ----------
3 Brad Betcha 2022-01-15
1 Amy Hock 2021-05-20
4 Fitch Jockland 2023-03-10
2 Sasha Pratt 2022-11-05

Now let’s add a clustered index to the HireDate column:

CREATE CLUSTERED INDEX IX_Employees_HireDate ON Employees(HireDate);

Now let’s run our query again:

SELECT * FROM Employees;

Result (Clustered Index):

EmployeeID  FirstName  LastName  HireDate                
---------- --------- -------- ----------
1 Amy Hock 2021-05-20
3 Brad Betcha 2022-01-15
2 Sasha Pratt 2022-11-05
4 Fitch Jockland 2023-03-10

We can see that the table is now sorted by the HireDate column.

Let’s drop the clustered index:

DROP INDEX IX_Employees_HireDate ON Employees;

And now, let’s add a primary key on the EmployeeID column, which will automatically create a clustered index on that column:

ALTER TABLE Employees
ADD CONSTRAINT PK_Employees PRIMARY KEY (EmployeeID);

After creating the primary key (and thus the clustered index), if we run the same SELECT statement, we’ll get:

SELECT * FROM Employees;

Result (After Primary Key and Clustered Index):

EmployeeID  FirstName  LastName  HireDate                
---------- --------- -------- ------------------------
1 Amy Hock 2021-05-20T00:00:00.000Z
2 Sasha Pratt 2022-11-05T00:00:00.000Z
3 Brad Betcha 2022-01-15T00:00:00.000Z
4 Fitch Jockland 2023-03-10T00:00:00.000Z

We can see that the rows are now physically stored and returned in order of the EmployeeID, which is both our primary key and clustered index key. This demonstrates how a primary key with its associated clustered index affects the physical order of data in the table.

Had we not dropped the clustered index before creating the primary key, the primary key would’ve been created with a nonclustered index. In this case, the table would still have been sorted by the HireDate column (due to the clustered index being on that column).

Conclusion

Clustered indexes, often created in conjunction with primary keys, are a powerful feature in SQL Server that can significantly improve query performance when used appropriately. By understanding how they work and considering their impact on our database design, we can make informed decisions about when and how to use them effectively.

Remember, while it’s common and often beneficial to have the clustered index on the primary key, the trick to optimal performance lies in choosing the right column(s) for our clustered index and balancing it with our non-clustered indexes and overall database design.

More About Indexes and Primary Keys

Here are some related articles I’ve written about indexes and primary keys: