Posted in data modeling, SQL Server

How I Really Feel about Surrogate Primary Keys

When someone goes on a rant about something they truly feel strongly about – good or bad, my typical joking response is “You’re being a little wishy-washy about this. Tell me how you really feel.” I recently had a conversation with co-workers about surrogate primary keys and realized I need to use that line on myself.

So let me tell you how I really feel about surrogate primary keys. Where’s my soapbox?

soapbox

Let’s start by talking about primary keys in general. When asking a DBA about normal forms in an interview, the answer you typically want to hear is some version of “the Key, the whole Key and nothing but the Key – so help me Codd.” In other words, the primary key is what makes each record in a table unique. All of the other columns in the table help support or expand the definition of what makes each record unique.

But a primary key tends to get used for more than that. Primary keys are often clustered, meaning they are used to define the natural sorting order of the table. They are also commonly used as foreign keys. This means changing the primary key values has larger repercussions – cascades to children tables or fragmentation and page splits to the primary key, etc.*

* Note the use of the words “often” and “commonly”. Two Fun Facts: 1. Primary keys do not have to be clustered and 2. you can also use unique constraints, not just primary keys, as foreign keys.

Because we try to avoid these issues and the problems they create, we tend to use surrogate primary keys. Often the surrogate primary key is that one ID field of a given data type that we throw on all of our tables. Integer IDENTITY columns are really popular among surrogate key options because they’re small in terms of bytes (4 bytes). Some people may even choose bigint for larger tables. Making these columns IDENTITY also means you don’t have to assign the values and they are assigned in order, which means it’s a good option for clustering the primary key.**

** Of course, you don’t have to use integer IDENTITY fields as surrogate primary keys. I’ve seen guids and varchar fields used as surrogate primary keys. Which, of course, leads to other issues so we’re not always preventing the problems we’re trying to avoid by choosing a surrogate primary key. But that’s another rant for another day.

The use of a surrogate primary key has become so ubiquitous that we’ve started to use them without thinking about them. I think there are even some ORM data layers that require that you have surrogate primary keys.

So what’s the problem?

Let’s start with the word “surrogate”.

According the Merriam Webster dictionary, here’s the definition of surrogate when used as a noun:

surrogate definition

I’m most interested in the second definition: one that serves as a substitute. Using this definition, this means a surrogate primary key is a representation of what the real primary key of the table is. But I think people often focus on the primary key part. They see a clustered primary key on a table and assume this means that the table is normalized because there is something that makes each record unique.

There’s a saying: If you stand for nothing, you’ll fall for anything. I’d like to steal this and adopt it for our databases:

If your surrogate primary key stands for nothing, your data integrity will fall.

This is why I have issues with surrogate primary keys. We get lazy. Since there’s a primary key on the table, we forget about our natural or logical or alternate primary keys. And then we forget to answer the question: if you didn’t have that surrogate primary key, what makes each record in your table unique?

This is why I think it’s so important for the natural primary key to be created on the physical table, as a unique constraint or unique index or even just a plain regular index if nothing else.

For starters, the surrogate primary key is not going to be able to enforce the data integrity of the table. For example, if the Name column is the natural key of the table, the application is going to be responsible for making sure that Name is unique in the table. That’s extra work to code around. Your application and queries are not going to be as efficient as they could be because they have to take this into consideration.

Plus, the application is not going to be the only way that data gets added to tables. (Don’t believe me? Ask someone who has Management Studio and a login with write permissions to your database.) Anyone who’s accessing that table may not know which column, or column combination, needs to be unique. By making sure the natural key is defined by a unique constraint or index, we can’t accidentally create a duplicate record based on the business logic because the primary key may be different.

In addition, queries would likely need to search on natural key of the table. This makes that column or columns good candidates for indexing anyway. And query optimizer could use unique constraints or indexes to come up with better execution plans. Add performance improvements as another reason why you want to have something representing the natural key.

There are cases where a surrogate primary key may be necessary because there is no other option to make the records unique. Sometimes audit trails or logging tables may make it difficult to find that natural key. But again, there is going to be a field or combination of fields that would be searched on frequently so you do have candidates for an index or indexes that could be used for queries.

But it’s also important to understand that you may not want the unique constraint but want to have a unique clustered index not cluster the primary key itself. The flexibility is still there depending on the needs of your table.

And that’s the most important point of all. By not relying solely on a surrogate primary key, you’re paying attention to the data in your table. That’s what it all comes down to – guaranteeing data integrity of our tables. After all, isn’t protecting data integrity part of our job as data professionals?

To be clear, I’m not “anti-surrogate primary keys.” I’m just “anti-not-adding-a-unique-constraint-or-index-on-natural-primary-keys-if-you-use-surrogate-primary-keys.” I hope this is something all of us can stand for.

 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s