A recent thread on pgsql-hackers warrants some more extensive discussion. In the past, I’ve criticized NULL semantics, but in this post I’d just like to explain some corner cases that I think you’ll find interesting, and try to straighten out some myths and misconceptions.
First off, I’m strictly discussing SQL NULL here. SQL NULL is peculiar in a number of ways, and the general excuse for this is that there is a need to represent “missing information” — which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, “unique”.
- “NULL is not a value” — If you hear this one, beware: it’s in direct contradiction to the SQL standard, which uses the phrase “null value” dozens of times. It’s hard to imagine that NULL is not any kind of value at all; because it’s routinely passed to functions and operators, predicates can evaluate to NULL, and SQL uses a kind of three-valued logic (3VL) in some contexts. The phrase “NULL is not a value” also raises the question: “what is it, then?”.
- NULL means “unknown” (i.e. the third truth value) — This doesn’t hold up either. SUM of no tuples returns NULL, but clearly the SUM of no tuples is not unknown! SQL will happily generate NULLs from aggregates or outer joins without any NULLs at all in the database. Do you not know something you did know before, or do you now know that you don’t know something that you didn’t know you didn’t know before? Also, if NULL means “unknown”, how do you differentiate a boolean field for which you do not know the value, and a boolean field for which you do know the value, and it happens to be “unknown” (perhaps this is why boolean columns are a PostgreSQL extension and not part of the core SQL standard)?
- “NULL is false-like” — Don’t think of NULL as false-like, or “more false than true”. It’s a tempting rule of thumb, but it’s misleading. For instance, in a WHERE clause, a NULL predicate is treated like FALSE. However, in a constraint (like a
CHECKconstraint), NULL is treated like TRUE. Perhaps most importantly, when in a 3VL context (like a boolean expression), this misconception leads to problems when you try to invert the logic, e.g., use
- “Oh, that makes sense” — When you see individual behaviors of NULL, they look systematic, and your brain quickly sees a pattern and extrapolates what might happen in other situations. Often, that extrapolation is wrong, because NULL semantics are a mix of behaviors. I think the best way to think about NULL is as a Frankenstein monster of several philosophies and systems stitched together by a series of special cases.
p OR NOT p— Everyone should know that this is not always true in SQL. But most people tend to reason assuming that this is always true, so you have to be very careful, and work against your intuition very deliberately, in order to form a correct SQL query.
- SUM() versus + (addition) — SUM is not repeated addition. SUM of
1 + NULLis NULL.
- Aggregates ignore NULLs — According to the standard, aggregates are supposed to ignore NULLs, because the information is missing. But why is it OK to ignore the missing information in an aggregate, but not, say, with the + operator? Is it really OK to just ignore it?
- Aggregates return NULL — According to the standard, aggregates are supposed to return NULL when they have no non-NULL input. Just because you don’t have any input tuples, does that really mean that the result is undefined, missing, or unknown? It’s certainly not unknown! What about SUM over zero tuples, wouldn’t the most useful result be zero?
- SQL breaks its own rules — The aforementioned aggregate rules don’t work very well for
COUNT(), the simplest of all aggregates. So, they have two versions of count:
COUNT(*)breaks the “aggregates ignore nulls” rule and the “aggregates return null” rule, and COUNT(x) only breaks the latter. But wait! There’s more:
ARRAY_AGG()breaks the former but not the latter. But no exception is made for SUM — it still returns NULL when there are no input tuples.
- NULLs appear even when you have complete information — Because of OUTER JOIN and aggregates, NULLs can appear even when you don’t have any NULLs in your database! As a thought experiment, try to reconcile this fact with the various “definitions” of NULL.
WHERE NOT IN (SELECT ...)— This one gets everyone at one point or another. If the subselect produces any NULLs, then NOT IN can only evaluate to FALSE or NULL, meaning you get no tuples. Because it’s in a WHERE clause, it will return no results. You are less likely to have a bunch of NULLs in your data while testing, so chances are everything will work great until you get into production.
x >= 10 or x <= 10— Not a tautology in SQL.
x IS NULL AND x IS DISTINCT FROM NULL— You probably don’t know this, but this expression can evaluate to TRUE! That is, if x = ROW(NULL).
NOT x IS NULLis not the same as
x IS NOT NULL— If
ROW(1,NULL), then the former will evaluate to TRUE, and the latter will evaluate to FALSE. Enjoy.
NOT x IS NULL AND NOT x IS NOT NULL— Want to know if you have a value like
ROW(1, NULL)? To distinguish it from NULL and also from values like
ROW(NULL,NULL), this expression might help you.
- NULLs can exist inside some things, but not others — If you concatenate:
firstname || mi || lastname, and “mi” happens to be null, the entire result will be null. So strings cannot contain a NULL, but as we see above, a record can.
I believe the above shows, beyond a reasonable doubt, that NULL semantics are unintuitive, and if viewed according to most of the “standard explanations,” highly inconsistent. This may seem minor; that is, if you’re writing SQL you can overcome these things with training. But it is not minor, because NULL semantics are designed to make you think you understand them, and think that the semantics are intuitive, and think that it’s part of some ingenious consistent system for managing missing information. But none of those things are true.
I have seen lots of discussions about NULL in various forums and mailing lists. Many of the participants are obviously intelligent and experienced, and yet make bold statements that are, quite simply, false. I’m writing this article to make two important points:
- There is a good case to be made that NULL semantics are very counterproductive; as opposed to a simple “error early” system that forces you to write queries that explicitly account for missing information (e.g. with
COALESCE). “Error early” is a more mainstream approach, similar to null pointers in java or None in python. If you want compile-time checking, you can use a construct like Maybe in haskell. SQL attempts to pass along the problem, hoping the next operator will turn ignorance into knowledge — but it does not appear that anyone thought through this idea, quite frankly.
- You should not attempt to apply your intellect to NULL, it will lead you in the wrong direction. If you need to understand it, understand it, but always treat it with skepticism. Test the queries, read the standard, do what you need to do, but do not attempt to extrapolate.