Database design includes a lot of concepts, and it all starts from understanding the business. If you’ve done a good job, you can usually tell a lot about the business from the schema alone — without data, and without application code or a user interface. This article is about data types, which are the “nouns” of your business, loosely speaking (the analogy is borrowed from C.J. Date’s An Introduction to Database Systems).
The first observation I’d like to make is that, when considering a relational database, we care about sets of things. If you have an attribute that holds “notes”, there’s not much you can do with a set of such values. I’m not saying that notes are unimportant, or that such fields should be avoided; but you should recognize the limitations of such attributes. All you can do is store text that a user wrote, or present a user with a specifically-requested note. The user then must actually read the note to learn anything — there’s nothing automatic about it (a database doesn’t help much with manual processes).
In the above example, the data type doesn’t matter much at all. It really makes no logical difference if you store it as text or as a PNG image of the rendered text (there may be a difference in convenience and performace, as with all format choices, but that’s beside the point). If the noun you are working with is “note”, the database system can’t help you any more if the attribute is “text” than it can if the attribute is of type “bytea” holding an image of the rendered text.
Astute readers will now be thinking: “but what about full-text search?”. Exactly. PostgreSQL has an excellent full-text search system, and powerful data types like “tsvector” and “tsquery”. But these aren’t just better data types that magically work, these represent a change in nouns that the business understands. A search engine doesn’t index a bunch of text fields, it indexes documents. You don’t submit text to the search engine, you submit a query. A document isn’t just a string of characters, it is a set of word stems with positional and weighting information. The business knows these things, and these nouns are spoken in conference rooms by executives and to customers by sales people.
I’m drawing a direct connection between the business and the data types because I sometimes think that data types are treated as an implementation detail. Let’s say you store the notes as the “text” type and are treating it as such. When you go to implement search, the natural thing to do would be something like: ...WHERE note LIKE '%foo%' OR note like '%bar%' OR ..., which is absolutely the wrong approach. The right approach is using a real query against a real document type, not trying to match substrings.
Treating the data types as an implementation detail is also related to the idea of only using primitive types, as though SQL were java with no class library. Primitive data types lead to awkward application code trying to answer questions that the database system should be able to answer. The philosophy of primitive data types is reflected in the SQL standard, which defines few types and does not seem to encourage sophisticated types like other languages do (of course, the SQL standard breaks away from the primitive data types because some types are just too important). For some reason database people have a negative reaction to any complex type, despite how many times they have proven to be important (for another great example, check out the geometry type in PostGIS, a sophisticated GIS system built entirely as a PostgreSQL plugin).
I believe a lot of this negative reaction is because of 1NF, which essentially describes what a relation is. The concern is that any type more complex than a string might be called a violation of 1NF, which I think is ridiculous. There is no useful definition of an atomic value (except maybe a boolean or bit or something): a string is a list (or array) of characters, that can be spliced, cut, and concatenated in many ways; an integer has a sign portion and a magnitude; and a timestamp is quite obviously non-atomic (and often treated as separate components). Additionally, normalization doesn’t have anything to do with data type choices; it only has to do with attributes and certain specific types of constraints (join dependencies, of which functional dependencies are a special case). So choosing a different data type carries no danger of violating a normal form. That being said, normalization is a small part of database design, and there are very legitimate reasons to be skeptical of strange data type choices.
I’m very pleased that PostgreSQL offers such a powerful type system. In many ways this is more challenging for a database system than an application language — for instance, PostgreSQL offers a variety of index access methods, execution plans, and statistics that all depend on understanding your data types and how to operate on them. In PostgreSQL, a user-defined data type is truly first-class, meaning that you can make use of all of this infrastructure in creative ways. However, creating a good data type is not a trivial process, so it’s best to approach it as a real project by itself.
I started the Temporal PostgreSQL project quite a while ago to offer a “period” data type (a time interval with an absolute beginning and end), because I think that is an important noun that is underserved by database systems. More on this later.