Choosing Data Types

Database design includes a lot of concepts, and it all starts from understanding the business. If you’ve done a good job, you can usually tell a lot about the business from the schema alone — without data, and without application code or a user interface. This article is about data types, which are the “nouns” of your business, loosely speaking (the analogy is borrowed from C.J. Date’s An Introduction to Database Systems).

The first observation I’d like to make is that, when considering a relational database, we care about sets of things. If you have an attribute that holds “notes”, there’s not much you can do with a set of such values. I’m not saying that notes are unimportant, or that such fields should be avoided; but you should recognize the limitations of such attributes. All you can do is store text that a user wrote, or present a user with a specifically-requested note. The user then must actually read the note to learn anything — there’s nothing automatic about it (a database doesn’t help much with manual processes).

In the above example, the data type doesn’t matter much at all. It really makes no logical difference if you store it as text or as a PNG image of the rendered text (there may be a difference in convenience and performace, as with all format choices, but that’s beside the point). If the noun you are working with is “note”, the database system can’t help you any more if the attribute is “text” than it can if the attribute is of type “bytea” holding an image of the rendered text.

Astute readers will now be thinking: “but what about full-text search?”. Exactly. PostgreSQL has an excellent full-text search system, and powerful data types like “tsvector” and “tsquery”. But these aren’t just better data types that magically work, these represent a change in nouns that the business understands. A search engine doesn’t index a bunch of text fields, it indexes documents. You don’t submit text to the search engine, you submit a query. A document isn’t just a string of characters, it is a set of word stems with positional and weighting information. The business knows these things, and these nouns are spoken in conference rooms by executives and to customers by sales people.

I’m drawing a direct connection between the business and the data types because I sometimes think that data types are treated as an implementation detail. Let’s say you store the notes as the “text” type and are treating it as such. When you go to implement search, the natural thing to do would be something like: ...WHERE note LIKE '%foo%' OR note like '%bar%' OR ..., which is absolutely the wrong approach. The right approach is using a real query against a real document type, not trying to match substrings.

Treating the data types as an implementation detail is also related to the idea of only using primitive types, as though SQL were java with no class library. Primitive data types lead to awkward application code trying to answer questions that the database system should be able to answer. The philosophy of primitive data types is reflected in the SQL standard, which defines few types and does not seem to encourage sophisticated types like other languages do (of course, the SQL standard breaks away from the primitive data types because some types are just too important). For some reason database people have a negative reaction to any complex type, despite how many times they have proven to be important (for another great example, check out the geometry type in PostGIS, a sophisticated GIS system built entirely as a PostgreSQL plugin).

I believe a lot of this negative reaction is because of 1NF, which essentially describes what a relation is. The concern is that any type more complex than a string might be called a violation of 1NF, which I think is ridiculous. There is no useful definition of an atomic value (except maybe a boolean or bit or something): a string is a list (or array) of characters, that can be spliced, cut, and concatenated in many ways; an integer has a sign portion and a magnitude; and a timestamp is quite obviously non-atomic (and often treated as separate components). Additionally, normalization doesn’t have anything to do with data type choices; it only has to do with attributes and certain specific types of constraints (join dependencies, of which functional dependencies are a special case). So choosing a different data type carries no danger of violating a normal form. That being said, normalization is a small part of database design, and there are very legitimate reasons to be skeptical of strange data type choices.

I’m very pleased that PostgreSQL offers such a powerful type system. In many ways this is more challenging for a database system than an application language — for instance, PostgreSQL offers a variety of index access methods, execution plans, and statistics that all depend on understanding your data types and how to operate on them. In PostgreSQL, a user-defined data type is truly first-class, meaning that you can make use of all of this infrastructure in creative ways. However, creating a good data type is not a trivial process, so it’s best to approach it as a real project by itself.

I started the Temporal PostgreSQL project quite a while ago to offer a “period” data type (a time interval with an absolute beginning and end), because I think that is an important noun that is underserved by database systems. More on this later.

5 thoughts on “Choosing Data Types

  1. If you have an attribute that holds “notes”, there’s not much you can do with a set of such values.

    This is an important point. Asking users why the need notes and how they use them is very enlightening. In my case, I’ve completely re-factored some schema designs to eliminate the need for the note field. The resulting application allows to users to easily identify the set of records that met a criteria that was formally stored as messages in user notes.

    • I’m sure there are a lot of great stories about “notes” fields :)

      I wonder how common it is to enter some specific value in another field that is known to be wrong, and then explaining the real value in the note to CYA. That way, the reports all reflect the lie, but the person who entered it can claim that they entered it correctly.

  2. I see what you’re saying about strings and dates seeming to be non-atomic, but as I’m reading “SQL and Relational Theory” by C. J. Date I’m learning that they really are.

    Date says that a data type is a “named finite set of values” and that to be a type, the equality operation has to be defined. That is, you can compare any two values in the type and say they are equal or not. That seems like a pretty straightforward basic definition of a data type.

    A string or date can be decomposed or transformed, but I would say that such operations yield a different value, not a part of the original value. Just like an integer expression like x+2 yields a different value from x.

    • The problem is: where do you draw the line? Normal forms are *formally* defined, and so they cannot rely on any informal definitions.

      Conceptually, yes: you should be skeptical of designs that use extraordinarily complex types. But (a) that does not constitute a violation of 1NF; and (b) complex types are certainly useful for attributes in many situations.

      I prefer to have access to the full power of a good type system when working with an RDBMS. I’m not suggesting we go down the road of OODBMSs or anything of the sort, but I do think that we should have access to an adequate set of types to describe a business.

      • Sorry, I may not have been clear. I do support using complex types. And I support the statement that a complex type (or even a semi-complex type like a string or a date) is *not* a violation of 1NF. A given string is an atomic value, because it passes the test that it can be equal to or not equal to another string in the same column.

        Even Date agrees in his book. He says that the relational model should permit even relation-valued columns. There’s no reason not to.