Identity over-use: Surrogate vs Composite keys in SQL Server
From my archive - originally published on 13 September 2009
The SQL Server 'Identity' column is a handy way of ensuring a unique primary key, but I have noticed a tendency for some database designs to over-use it.
In particular there appears to be a preference for using an identity column as a surrogate key over a composite key made up of one or more columns. Using an identity column for tables that define entities is fair enough, but I’m not sure why you would use a surrogate key over a composite key for tables that relate entities.
Quick definitions
Firstly, it's worth pointing out that an identity column is only a type of primary key. A primary key is a set of columns that make the data in each table row unique, a composite key being a primary key that is made up of one or more columns.
Most tables will have some kind of natural key, where a set of columns provide a good candidate for defining the unique attributes of each row. However, many data designs prefer to add an extra, surrogate key into the design. The surrogate key does not occur naturally in the data - it is an artificial construct designed to provide each row with a unique value. In SQL Server, the Identity column is often used to provide this surrogate key
Why use a surrogate key?
The main problem with using a natural key is that it the data is vulnerable to changes in an organisation's business processes. Ideally you will want your primary keys to be immutable, i.e. not subject to change over the lifetime of the application.
As a simple example, if you have a department code that you are using as a primary key for a table, your database design becomes vulnerable to changes in the departmental structure. If the department code changes, you will have to change any data in related tables as well as updating your primary key.
Surrogate keys get round this problem by ensuring that business logic does not appear in the primary key.
Surrogate keys can also become useful where the natural key is inconvenient or difficult to work with, i.e. there are too many columns or the columns include large amount of text. That said, in many cases SQL Server database designers can be too quick to reach for the identity column.
The downside of identity columns
I've always been uncomfortable with the SQL Server identity column as it is a platform-specific invention that does not tally with “clean” relational data design. By using it you are adding extra, redundant information for the sake of the target platform.
Many people point to a potential performance bonus in using an identity column, but a surrogate key will not improve performance in most cases unless you are selecting records by the surrogate key alone. If you want to maximise the performance of your selects in SQL Server then you will want the fields that you will be selecting on in an index specified in the correct order, preferably in a clustered index.
A surrogate key might seem like a tidier solution than a key based on more than one column, but they do not simplify data design - using an identity column over a composite key actually complicates the data design. In order to guarantee data integrity you will have to add a constraint and an index onto the columns that would form the composite key as well as having to create an identity column and primary key.
There are also a number of practical maintenance issues with identity columns which do make life more difficult in the long term:
- SQL identity columns cannot be updated directly without resorting to the SQL IDENTITY_INSERT flag. It's pretty messy to work with.
- If you ever have to replicate databases then SQL identity columns become a nightmare to handle.
- An extra, meaningless column does not help data readability – it makes figuring out data problems in multiple, large tables that much more difficult (“a #382738 by any other name...”).
- The logical data design becomes dependant on the physical platform – this is a nightmare if you ever have to switch database platform (believe me, it does happen...).
Where I would use a composite key
There is a distinction between those tables that define entities and those that relate entities.
A table that defines entities is one that describes a single entity within your problem domain, say a person or a product. In this case, a surrogate primary key may well be more appropriate than using the natural key as it is will not be vulnerable to any changes in the characteristics of the entities over time.
However, if you want to relate people and products in a table then this is a different story. The natural key in this case would be a combination of the people and product primary keys. Adding an extra surrogate key would only serve to complicate the design and undermine data integrity without bringing any benefits in terms of performance or maintenance.
In this simple example, the table that relates entities is a good candidate for a composite primary key and the design would not be improved by reaching for a surrogate.