None, nil, Nothing, undef, NA, and SQL NULL

In my last post, Why DBMSs are so complex, I raised the issue of type mismatches between the application language and the DBMS.

Type matching between the DBMS and the application is as important as types themselves for successful application development. If a type behaves one way in the DBMS, and a “similar” type behaves slightly differently in the application, that can only cause confusion. And it’s a source of unnecessary awkwardness: you already need to define the types that suit your business best in one place, why do you need to redefine them somewhere else, based on a different basic type system?

At least we’re using PostgreSQL, the most extensible database available, where you can define sophisticated types and make them perform like native features.

But there are still problems. Most notably, it’s a non-trivial challenge to find an appropriate way to model NULLs in the application language. You can’t not use them in the DBMS, because the SQL spec generates them from oblivion, e.g. from an outer join or an aggregate function, even when you have no NULLs in your database. So the only way to model the same semantics in your application is to somehow make your application language understand NULL semantics.

Here’s how SQL NULL behaves:

=> -- aggregate with one NULL input
=> select sum(column1) from (values(NULL::int)) t;
sum
-----

(1 row)

=> -- aggregate with two inputs, one of them NULL
=> select sum(column1) from (values(1),(NULL)) t;
sum
-----
1
(1 row)

=> -- aggregate with no input
=> select sum(column1) from (values(1),(NULL)) t where false;
sum
-----

(1 row)

=> -- + operator
=> select 1 + NULL;
?column?
----------

(1 row)

I’ll divide the “NULL-ish” values of various languages into two broad categories:

  1. Separate type, few operators defined, error early, no 3VL — Python, Ruby and Haskell fall into this category, because their “NULL-ish” types (None, nil, and Nothing, respectively) usually result in an immediate exception, unless the operator to which the NULLish value is passed handles it as a special case. Few built-in operators are defined for arguments of these types. These fail to behave like SQL NULL, because they employ no three-valued logic (3VL) at all, and thus fail in the forth portion of the SQL example.
  2. Member of all types, every operator defined — Perl and R fall into this category. Perl’s undef can be passed through many built-in operators (like +), but doesn’t ever use 3VL, so fails the forth portion of the SQL example. R uses a kind of 3VL for it’s NA value, but it uses it everywhere, so sum(c(1,NA)) results in NA (thus failing the second portion of the SQL example). In R, you can omit NAs from the sum explicitly (not a very good solution, by the way), but then it will fail the first portion of the SQL example.

As far as I can tell (correct me if I’m mistaken), none of these languages support the third portion of the SQL example: the sum of an empty list in SQL is NULL. The languages that I tested with a built-in sum operator (Python, R, Haskell) all return 0 when passed an empty list.

Languages from the first category appear safer, because you will catch the errors earlier rather than later. However, transforming SQL NULLs in these languages to None, nil, or Nothing is actually quite dangerous, because a change in the data you store in your database (inserting NULLs or deleting records that may be aggregated) or even a change in a query (outer join, or an aggregate that may have no input) can produce NULLs, and therefore produce exceptions, that can evade even rigorous testing procedures and sneak into production.

Languages from the second category tend to pass the “undef” or “NA” along deeper into the application, which can cause unintuitive and difficult-to-trace problems. Perhaps worse, something will always happen, and usually the result will take the form of the correct answer even if it is wrong.

So where does that leave us? I think the blame here rests entirely on the SQL standard’s definition of NULL, and the inconsistency between “not a value at all” and “the third logical value” (both of which can be used to describe NULL in different contexts). Not much can be done about that, so I think the best strategy is to try to interpret and remove NULLs as early as possible. They can be removed from result sets before returning to the client by using COALESCE, and they can be removed after they reach the client with client code. Passing them along as some kind of special value is only useful if your application already must be thoroughly aware of that special value.

Note: Microsoft has defined some kind of “DBNull” value, and from browsing the docs, it appears a substantial amount of work went into making them behave as SQL NULLs. This includes a special set of SQL types and operators. Microsoft appears to be making a lot of progress matching DBMS and application types more closely, but I think the definition of SQL NULLs is a lost cause.

7 thoughts on “None, nil, Nothing, undef, NA, and SQL NULL

  1. “Languages from the second category tend to pass the “undef” or “NA” along deeper into the application, which can cause unintuitive and difficult-to-trace problems. Perhaps worse, something will always happen, and usually the result will take the form of the correct answer even if it is wrong.”

    At least for R, I disagree with the statements that the results will take the form of the correct answer even if wrong. In fact, I think that the exact opposite is true. In SQL the inconsistent way in which NULLS are treated can yield incorrect answers that look correct.

    One of the advantages of mapping NULL’s to NA’s is that the NA’s propagate through the standard operators in a natural way. It is for that reason that I often prefer to do analysis in R rather than in the database. Consider if one were collecting anonymous data on the physical characteristics of a population. For heights we could store the data in the table below.

    CREATE TABLE heights {
    gender enum(‘male’, ‘female’);
    height_meters float;
    }

    with the data set

    female 1.48
    female 1.54
    female 1.57
    female 1.41
    male NULL
    male 1.72
    male 1.79
    male NULL

    If someone asks what is the population’s expected average height, the typical answer is

    select average(height_meters) from heights;

    Which would return the wrong answer. For the above population, a better estimate for the average height would be

    select AVG(height_meters)/2 from heights where gender = ‘male’
    + select AVG(height_meters)/2 from heights where gender = ‘female’

    In R, under the NULL NA mapping, this becomes immediately apparent:

    select height_meters from heights

    then mean( heights ) in R yields NA – a correct answer. Given the above data set, the mean IS unknown. I would much prefer to receive NA as my answer than to receive an answer that looks correct but isn’t.

    The real problem with NULL’s isn’t that they exist: it’s how entrenched they are in SQL. It would be far better for the standard to raise errors than silently insert NULL’s into result sets. This fits in line in with your assertion that “the best strategy is to try to interpret and remove NULLs as early as possible”. Right now, NULL’s are hard to intercept and deal with: particularly at the application level. A better error mechanism would help this. In addition, clearly “the inconsistency between ‘not a value at all’ and ‘the third logical value’” is bad: the third logical type component should be handled at the individual type level ( i.e. int_w_null and int_wo_null ) and the ‘not a value’ component ( again ) should be handled by an error hierarchy.

    However, reading between the lines of this post and others, the idea that we could eliminate NULL’s and be better off is, I think, incorrect. NULL’s are a solution ( albeit a poor one ) to a complex, very general, and difficult problem. I think that sum( empty set ) -> NULL is stupid; however, if I know the standard, I can determine with one statement whether a result set is empty AND what the sum of the non-null data is. There’s no extra database call to find the result set cardinality – like I do in R. There’s no need to go back to the server, as I would in 99% of cases if, like R, the database returned NA for the sum over any NA’s. And the missing case – the check for NULL existence – is fast and explicit.

    • “I disagree with the statements that the results will take the form of the correct answer even if wrong”

      Take a look at this example:
      > 1 %in% c(3,2,NA)
      [1] FALSE

      That takes the form of a correct answer, but it looks wrong to me.

      However, I’ll partially retract my claim that R passes the values along, because I did not realize that R throws an exception when testing NA (e.g. if(NA)). This puts R closer to the first category.

      “In SQL the inconsistent way in which NULLS are treated can yield incorrect answers that look correct.”

      I agree with that statement.

      “then mean( heights ) in R yields NA – a correct answer”

      Of course, if you start by using NULL to represent the third logical value, and then translate that to the third logical value in R — NA — you may get a correct result.

      But that’s not always the case. Say you have a query to find the total value of cars by dealership, something like:

      SELECT dealership_name, sum(car_value) FROM car GROUP BY dealership_name;

      If some dealership has no cars, you get a NULL. That NULL should really be a 0. But because it’s a NULL, you end up with an NA in R, even though *there are no NULLs in your base data*. Now, there are all kinds of answers that R can’t give you, because it thinks that you’re missing information, even though your information is complete.

      “NA’s propagate through the standard operators in a natural way.”

      It’s certainly better than in SQL, I agree with that.

      But how natural is this?:
      > (x > 5 || x < 10)
      [1] NA
      > (x || !x)
      [1] NA

      Humans generally reason using basic tautologies, like “P OR NOT P” is always true. When you introduce 3VL, you lose useful tautologies such as that one. To the extent that “P OR NOT P” is intuitive, 3VL is unintuitive. So I don’t think it can be said without explanation that it is “natural”.

      That being said, just because 3VL is unintuitive or unnatural doesn’t imply that it’s bad. What it does mean is that, if you base a system on 3VL, the users need to be educated in detail about your particular brand of 3VL semantics to overcome that intuition.

      “the idea that we could eliminate NULL’s and be better off is, I think, incorrect”

      That’s a pretty broad statement. First of all, there are at least a couple questions here:

      1. Do we want 3VL?
      2. If so, what is a good 3VL system to use?
      3. Is there any 3VL system in existence (and clearly documented) with some kind of careful analysis behind it, that can stand up to criticism?
      4. If 3VL is so important, why do most languages not provide it?
      5. Are we talking about 3VL at all, or are we talking about something else (like “not applicable” or “absent value”)?

      I don’t like the vague notion that we should have something like NULL, regardless of the costs — often with no real analysis about what those costs may be. R seems to be much more sane, but it’s hardly without costs. And there are really no guidelines anywhere about how to manage those costs in day-to-day programming. In other words, which functions/methods should pass along an unknown value, which should raise an exception, and which should produce a value? 3VL permeates every aspect of application and database design, and I think that it’s unwise to approach it as though any idea is a good solution.

      “[with NULLs] I can determine with one statement whether a result set is empty AND what the sum of the non-null data is.”

      Actually, a NULL return from SUM() is still ambiguous. You’re just shifting the ambiguity somewhere else.

      There are 4 cases:
      (1) You are passed no values
      (2) You are passed only non-NULL values
      (3) You are passed only NULLs
      (4) You are passed a mix of NULLs and non-NULL values

      In SQL, when SUM() returns NULL, that may mean either case #1 or case #3. When SUM() returns a non-NULL value that may mean either case #2 or case #4.

      In R, with NAs, if sum() returns a value, that may mean case #1 or case #2, and if it returns NA, that may mean case #3 or case #4.

      You can’t eliminate ambiguity through crazy NULL semantics. I strongly prefer R’s semantics; I believe that it’s much more important to distinguish between #1 and #3 than #1 and #2.

      “There’s no need to go back to the server,”

      Why go back to the server? SELECT COUNT(*), SUM(foo) …

  2. Pingback: What is the deal with NULLs? « Experimental Thoughts

  3. About Haskell:

    1) Nothing isn’t a type. It is a value of the type Maybe a (for any type a).

    2) It doesn’t “usually result in immediate exception”. All functions and operators which take Maybe a at all will normally handle the Nothing case; if they didn’t, they wouldn’t accept Maybe a.

    Functions which don’t accept Maybe (e.g. +) may be lifted into the Maybe monad. In this case, if any arguments are Nothing, so is the result.

    To take your 4th SQL example:

    Just 1 + Nothing won’t compile;

    liftM2 (+) (Just 1) Nothing will return Nothing.

    • > [Haskell] doesn’t “usually result in immediate exception”

      Yeah, it will generally happen even earlier: at compile time.

      I grouped Haskell in the first category because you generally need to handle the cases of missing data, and also you can generally do whatever you want with the missing data.

      SQL forces you into certain behaviors by defining things like SUM to completely skip over missing data. To try to handle it in some other way is awkward, and to try to force an exception to be raised when data is missing is even more awkward.

      But yes, you’re right, I was speaking too loosely, and to include Haskell in this discussion warranted a little more explanation (and a little more knowledge; I don’t really know Haskell).

  4. In my opinion NULL’s behaviour is deeply considered and NULLs behave in exact way they should to provide required functionality. NULL can be considered as ‘does not apply for that row’ and aggregate functions should ignore NULL values. However, adding anything to something that ‘does not apply’ makes the whole expression ‘not applicant’.
    NULLs are very logical, as long as you don’t look for similarities in more primitive (3rd generation) language.

    • > aggregate functions should ignore NULL values

      The why doesn’t COUNT(*) ignore NULL values? Or ARRAY_AGG(a)?

      > NULLs are very logical

      Can you point me to a specific algebra of NULLs, similar to the standard 2VL boolean logic algebra?