Terminology Confusion | Experimental Thoughts

I recently read the following article, another opinion in the long-standing surrogate key vs. natural key “debate”:

http://rapidapplicationdevelopment.blogspot.com/2007/08/in-case-youre-new-to-series-ive.html

I put “debate” in quotes because it’s filled with so much confusion, I don’t think most people understand what they’re arguing over. People use different definitions for surrogate keys, often in the same discussion, so for the purposes of this article I’ll define surrogate key to mean: “A system generated key that is not derived from the business rules, hidden from the business, but visible to the application”. The surrogate key is used to take the place of a natural key, which is derived from the business rules. In other words, a surrogate key is a pointer.

The article seems objective and reasonable at first pass, and the author seems to be honestly attempting to analyze the trade-offs. But all of the arguments the author makes are really business arguments for a generated unique identifier, not arguments for an identifier that is hidden from the business.

Making a positive match between people is not always easy, because people don’t want it to be. It’s called privacy, and it has rational benefits (it makes it harder for people to steal your identity) and psychological benefits (people naturally like privacy). So what do businesses and government agencies do in order to get the job done? They each assign individuals a generated identity specific to that organization. An average person probably has many of these identities: SSN, grocery store savings card, student identification number, driver’s license number, etc. Having so many identities helps preserve our privacy, because it’s not easy to obtain a mapping between the various identities.

I know about these many identities because they are not hidden from the business, they are explicitly laid out in the business rules by which these organizations operate. Hidden surrogate keys do not model reality, and do not help the business operate at all. You still have the same problems that you did before.

So once we realize that generated identities are a business rule, and not something to hide from the business, we see that there is really no conflict: the business uses generated identities when required, and the database uses those generated identities as keys where appropriate. But now that key is a natural key, because it models the real world.

It may be that the designing of a database is an opportune time to revisit the business rules, and perhaps add a generated identity into the business processes, and that identity may be generated by the DBMS. But it’s important to remember that new business rules are a business decision, and a database model can only be as good as the business rules that it represents. And most importantly, it should represent reality, not what the database designer thinks reality should be.

Notice that I didn’t mention performance. That’s because relational databases allow you to separate logical design and physical layout. If pointers help performance, they should be used at the physical layer (that is, not even visible to the application), but completely invisible to the logical layer.

2 thoughts on “Terminology Confusion”

JOG on January 8, 2008 at 02:40 said:

A very sharp analysis. Clarity of thought and good argumentation. Kudos indeed.
- Jeff Davis on January 8, 2008 at 03:01 said:
  
  I appreciate the compliment.
  
  I assume you’re the same JOG as in comp.databases.theory, in which case I would like to pay you the same compliment in return: you consistently inject clarity into the discussions in that group.