Introduction to Indexing in SQL

When working with databases, performance can be very important. This is especially true in production environments where the end users expect their queries and reports to be generated within seconds (or even milliseconds).

While blistering fast queries may be the norm with smaller datasets, as our databases grow larger and more complex, it can become much more of a challenge to keep our queries nice and snappy. When working with smaller datasets, it’s often possible to get lightning speed results even when not optimizing for speed. But as the datasets grow larger, we need more efficient tools and techniques to retrieve and manipulate data. One such tool is the index.

What is an Index?

An index is a data structure that allows us to quickly locate and access specific rows in a database table. It can help our queries run faster – much faster in some cases.

There are different types of indexes. Some can be thought of like a book’s table of contents, where it outlines the order of the book. Others are more akin to the index at the back of the book, where it helps us quickly locate certain words or terms within the book.

It’s the same concept with database indexing. We can create an index that determine the order of the rows in the table. And we can create other indexes that help the DBMS more quickly find the data we’re searching for.

When we create an index, the index is stored in the database, and then the database engine can use the index in order to provide a faster execution of the query.

How Indexing Works

When we create an index on a column or set of columns, the database management system (DBMS) creates a separate structure that stores the column values along with pointers to the corresponding table rows. This structure is typically organized in a way that allows for fast searching, such as a B-tree.

When we run a query, the DBMS works out the most efficient way to execute it, based on the tables and data in play, as well as any relevant indexes that exist. If it chooses to use an index, it will do so because it considered it to be the most efficient option (which usually means that it’s faster than doing a full table scan).

Creating an Index

To create an index in SQL, we can use the CREATE INDEX statement.

Here’s an example:

CREATE INDEX idx_movie_title ON movies(title);

This creates an index called idx_movie_title on the title column of the movies table.

Now, when we run a query like this:

SELECT * FROM movies 
WHERE title LIKE '%Quantum%';

The database can use the index to quickly locate the relevant rows, significantly reducing the query execution time. Without an index, the database would need to perform a full table scan, checking every row in the table.

Depending on your DBMS, the CREATE INDEX statement may also allow you to specify other options for the index, such as whether it’s a clustered or nonclustered index, a unique index, etc, along with other more granular options.

Some DBMSs also allow us to create indexes while creating the table. In such cases, the CREATE TABLE syntax allows for an INDEX argument to specify the desired index. For example:

CREATE TABLE books (
    book_id INT,
    title VARCHAR(255) NOT NULL,
    release_year INT
    PRIMARY KEY (book_id),
    INDEX idx_book_title(title)
);

Here, we can see that the last line creates the index, using the INDEX argument. It creates an index called idx_book_title on the title column.

Also, when we create primary keys, foreign keys, and other constraints (such as UNIQUE constraints), the DBMS may create an index behind the scenes for us. Many DBMSs do this by default. For example, SQL Server creates a clustered index for the primary key and a nonclustered index for any UNIQUE constraints, but we have the option of changing these so that the primary key is nonclustered and the UNIQUE constraint is clustered.

Composite Indexes

Indexes can be applied to multiple columns. When we create an index like this, it’s called a composite index (also known as a multi-column index or a concatenated index).

Here’s an example of creating a table with a composite index:

CREATE TABLE customers (
    customer_id INT NOT NULL,
    last_name   CHAR(30) NOT NULL,
    first_name  CHAR(30) NOT NULL,
    PRIMARY KEY (customer_id),
    INDEX idx_full_name(last_name,first_name)
);

This code creates an index called idx_full_name on the last_name and first_name columns.

The above composite index could be useful for the following queries:

SELECT * FROM customers 
WHERE last_name='Simpson';

SELECT * FROM customers 
WHERE last_name='Simpson' AND first_name='Homer';

SELECT * FROM customers
WHERE last_name='Simpson'
AND (first_name='Homer' OR first_name='Bart');

SELECT * FROM customers
WHERE last_name='Simpson'
AND first_name >='B' AND first_name < 'J';

But it wouldn’t be used if we searched for just the first name:

SELECT * FROM customers 
WHERE first_name='Homer';

That’s because composite indexes work from the left. For example, if we create the following three-column index:

CREATE INDEX idx_full_name ON customers(first_name, middle_name, last_name);

The index can be used when searching the following:

SELECT * FROM customers 
WHERE first_name='Homer';

SELECT * FROM customers 
WHERE first_name='Homer' AND middle_name='Jay';

SELECT * FROM customers 
WHERE first_name='Homer' AND middle_name='Jay' AND last_name='Simpson';

But it can’t be used on the following:

SELECT * FROM customers 
WHERE middle_name='Jay';

SELECT * FROM customers 
WHERE middle_name='Jay' AND last_name='Simpson';

That’s because those columns aren’t leftmost in the index.

As with most things in databases, indexing can be quite a complex topic loaded with exceptions and caveats. There are cases where the “leftmost” rule doesn’t apply. For example, some DBMSs can use a “skip scan” method that can allow scanning an index on keys that are not the leftmost. As always, it’s best to check your DBMS’s documentation to see exactly how indexing is implemented before deciding which approach to use.

Clustered vs Nonclustered

Not all indexes are the same. There are different types of index, and so, not only might we need to decide whether or not to create an index, we might also need to decide what type of index to create. For example, the most common types of index in SQL Server are the clustered and the nonclustered index.

Clustered indexes define how the table is sorted. They sort and store the data based on their key values (i.e. the columns included in the index definition). We can only have one clustered index per table, because the rows can only be stored in one order. Clustered indexes are often applied to the primary key (sometimes automatically by the DBMS), and so the table ends up being sorted by the primary key column/s, but this isn’t a requirement. We could apply a clustered index to another column if need be. Think of clustered indexes like the table of contents at the start of a book. The table of contents show how the book is ordered, which is great when we want to make our way through the book from start to finish. Not so great if we want to find an arbitrary word or term within the book. That’s what the index at the back of the book is for.
Nonclustered indexes are like the index at the back of the book. They are a separate structure to the table, and they are ordered by different columns to the clustered index. For example, if the clustered index is on the movie_id column, then we might create a nonclustered index on the title column in order to make it faster when we run queries that search the title column. If we ever need to search the movie_id column, the clustered index can take care of that.

In SQL Server, nonclustered indexes are the default. That means when we use CREATE INDEX without specifying its type, then it will be a nonclustered index. Oftentimes, this is exactly what we want. By default SQL Server creates a clustered index on the primary key, and so we can’t create any more clustered indexes on the table anyway (a table can only have one clustered index). Therefore, a nonclustered index is usually our preferred option anyway. That is not to say that we can’t swap the primary key’s clustered index for a nonclustered index, and then use a clustered index elsewhere, but that would only be done after careful analysis.

Other DBMSs have different terms for their various indexes, and may implement indexes differently. For example, integrated index (also called a main index or primary index) might be thought of as akin to a clustered index. And external index or secondary index might be thought of as similar to the nonclustered index.

How to Use the Index?

Once created, we don’t normally decide when to use the index. The DBMS will decide that for us. Well, some DBMSs do give us the option of specifying/suggesting that an index should be used on a given query, but in general, the DBMS will be able to decide the best course of action.

Many DBMSs have a feature called the query optimizer. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

So when we run a query, the query optimizer will decide whether or not to do a full table scan or use the index. In cases where we have a lot of data, the index could potentially improve performance quite significantly over doing a table scan.

A table scan is where each row of the table is read in a sequential order and each column is checked. These are usually much slower than using an index, mainly due to having to do a lot of I/O reads from the disk. This is usually the case even if the query only selects a few rows from the table, because all rows in the table will need to be examined.

But if the table is small, then the optimizer might decide that it’s quicker to do a table scan, rather than go over to the index. There are other occasions where the query optimizer might opt for a table scan, but generally speaking, on large tables one would normally expect it to choose the index (assuming one has been created, and it covers the search predicate used by our query).

Pros and Cons of Indexes

While it may seem that creating indexes are a no-brainer due to their advantages of speeding up queries, they aren’t without their disadvantages.

Let’s look at some of the pros and cons of indexes.

Advantages

Increased speed: Can significantly speed up query execution.
Increased efficiency: Can allow the DBMS to find the most efficient means of executing the query.
Uniqueness: Can be used to create uniqueness (to avoid duplicate records, such as with a primary key or UNIQUE index).
Flexibility: We can tailor our indexes to suit the real world queries that will be run against the table. For example, we can choose columns that are most likely to be searched on the most.

Disadvantages

Disk space: Indexes require more disk space, as they are another database object to store.
RAM: Indexes can increase our RAM requirements. The DBMS will usually try to store our indexes in RAM in order to make things even faster. If we create too many indexes, this could cause problems if we don’t have enough RAM.
Can slow down INSERT, DELETE, and UPDATE operations. Whenever we change data in the table, the DBMS also needs to change it in the index.

So we should weigh up the advantages and disadvantages each time before we decide to add another index to our database.

Best Practices for Indexing

Here are some important factors to consider regarding indexing:

Index columns used frequently in WHERE clauses, JOIN conditions, and ORDER BY clauses.
Avoid over-indexing, as it can slow down write operations and increase storage requirements.
Consider composite indexes for queries that frequently use multiple columns together.
Regularly analyze and update statistics on indexes to ensure optimal query plans.
Monitor index usage and remove unused indexes to maintain database performance.

Conclusion

Indexing can be great for improving query performance in our databases. By creating indexes on frequently used columns, we can significantly reduce query execution times and enhance overall database efficiency.

However, indexes do have their disadvantages too, so we still need to weigh up the pros and cons before adding an index to our database.

That said, any time we’re faced with a slow running query, we should consider whether an index might be helpful.