Comments on: None, nil, Nothing, undef, NA, and SQL NULL http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/ Ideas on Databases, Logic, and Language by Jeff Davis Tue, 19 Jun 2012 16:18:51 +0000 hourly 1 http://wordpress.org/?v=3.3.1 By: Jeff Davis http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/comment-page-1/#comment-89 Jeff Davis Sun, 09 Aug 2009 00:15:00 +0000 http://davisjeff.wordpress.com/?p=14#comment-89 > [Haskell] doesn’t “usually result in immediate exception” Yeah, it will generally happen even earlier: at compile time. I grouped Haskell in the first category because you generally need to handle the cases of missing data, and also you can generally do whatever you want with the missing data. SQL forces you into certain behaviors by defining things like SUM to completely skip over missing data. To try to handle it in some other way is awkward, and to try to force an exception to be raised when data is missing is even more awkward. But yes, you're right, I was speaking too loosely, and to include Haskell in this discussion warranted a little more explanation (and a little more knowledge; I don't really know Haskell). > [Haskell] doesn’t “usually result in immediate exception”

Yeah, it will generally happen even earlier: at compile time.

I grouped Haskell in the first category because you generally need to handle the cases of missing data, and also you can generally do whatever you want with the missing data.

SQL forces you into certain behaviors by defining things like SUM to completely skip over missing data. To try to handle it in some other way is awkward, and to try to force an exception to be raised when data is missing is even more awkward.

But yes, you’re right, I was speaking too loosely, and to include Haskell in this discussion warranted a little more explanation (and a little more knowledge; I don’t really know Haskell).

]]>
By: Jeff Davis http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/comment-page-1/#comment-84 Jeff Davis Fri, 07 Aug 2009 15:08:47 +0000 http://davisjeff.wordpress.com/?p=14#comment-84 > aggregate functions should ignore NULL values The why doesn't COUNT(*) ignore NULL values? Or ARRAY_AGG(a)? > NULLs are very logical Can you point me to a specific algebra of NULLs, similar to the standard 2VL boolean logic algebra? > aggregate functions should ignore NULL values

The why doesn’t COUNT(*) ignore NULL values? Or ARRAY_AGG(a)?

> NULLs are very logical

Can you point me to a specific algebra of NULLs, similar to the standard 2VL boolean logic algebra?

]]>
By: Łukasz Lech http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/comment-page-1/#comment-77 Łukasz Lech Thu, 06 Aug 2009 06:30:40 +0000 http://davisjeff.wordpress.com/?p=14#comment-77 In my opinion NULL's behaviour is deeply considered and NULLs behave in exact way they should to provide required functionality. NULL can be considered as 'does not apply for that row' and aggregate functions should ignore NULL values. However, adding anything to something that 'does not apply' makes the whole expression 'not applicant'. NULLs are very logical, as long as you don't look for similarities in more primitive (3rd generation) language. In my opinion NULL’s behaviour is deeply considered and NULLs behave in exact way they should to provide required functionality. NULL can be considered as ‘does not apply for that row’ and aggregate functions should ignore NULL values. However, adding anything to something that ‘does not apply’ makes the whole expression ‘not applicant’.
NULLs are very logical, as long as you don’t look for similarities in more primitive (3rd generation) language.

]]>
By: Alexey Romanov http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/comment-page-1/#comment-55 Alexey Romanov Mon, 03 Aug 2009 12:26:33 +0000 http://davisjeff.wordpress.com/?p=14#comment-55 About Haskell: 1) Nothing isn't a type. It is a value of the type Maybe a (for any type a). 2) It doesn't "usually result in immediate exception". All functions and operators which take Maybe a at all will normally handle the Nothing case; if they didn't, they wouldn't accept Maybe a. Functions which don't accept Maybe (e.g. +) may be lifted into the Maybe monad. In this case, if any arguments are Nothing, so is the result. To take your 4th SQL example: Just 1 + Nothing won't compile; liftM2 (+) (Just 1) Nothing will return Nothing. About Haskell:

1) Nothing isn’t a type. It is a value of the type Maybe a (for any type a).

2) It doesn’t “usually result in immediate exception”. All functions and operators which take Maybe a at all will normally handle the Nothing case; if they didn’t, they wouldn’t accept Maybe a.

Functions which don’t accept Maybe (e.g. +) may be lifted into the Maybe monad. In this case, if any arguments are Nothing, so is the result.

To take your 4th SQL example:

Just 1 + Nothing won’t compile;

liftM2 (+) (Just 1) Nothing will return Nothing.

]]>
By: What is the deal with NULLs? « Experimental Thoughts http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/comment-page-1/#comment-35 What is the deal with NULLs? « Experimental Thoughts Sun, 02 Aug 2009 22:40:26 +0000 http://davisjeff.wordpress.com/?p=14#comment-35 [...] which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, [...] [...] which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, [...]

]]>
By: Jeff Davis http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/comment-page-1/#comment-27 Jeff Davis Thu, 14 Aug 2008 08:33:49 +0000 http://davisjeff.wordpress.com/?p=14#comment-27 "I disagree with the statements that the results will take the form of the correct answer even if wrong" Take a look at this example: > 1 %in% c(3,2,NA) [1] FALSE That takes the form of a correct answer, but it looks wrong to me. However, I'll partially retract my claim that R passes the values along, because I did not realize that R throws an exception when testing NA (e.g. if(NA)). This puts R closer to the first category. "In SQL the inconsistent way in which NULLS are treated can yield incorrect answers that look correct." I agree with that statement. "then mean( heights ) in R yields NA - a correct answer" Of course, if you start by using NULL to represent the third logical value, and then translate that to the third logical value in R -- NA -- you may get a correct result. But that's not always the case. Say you have a query to find the total value of cars by dealership, something like: SELECT dealership_name, sum(car_value) FROM car GROUP BY dealership_name; If some dealership has no cars, you get a NULL. That NULL should really be a 0. But because it's a NULL, you end up with an NA in R, even though *there are no NULLs in your base data*. Now, there are all kinds of answers that R can't give you, because it thinks that you're missing information, even though your information is complete. "NA's propagate through the standard operators in a natural way." It's certainly better than in SQL, I agree with that. But how natural is this?: > (x > 5 || x < 10) [1] NA > (x || !x) [1] NA Humans generally reason using basic tautologies, like "P OR NOT P" is always true. When you introduce 3VL, you lose useful tautologies such as that one. To the extent that "P OR NOT P" is intuitive, 3VL is unintuitive. So I don't think it can be said without explanation that it is "natural". That being said, just because 3VL is unintuitive or unnatural doesn't imply that it's bad. What it does mean is that, if you base a system on 3VL, the users need to be educated in detail about your particular brand of 3VL semantics to overcome that intuition. "the idea that we could eliminate NULL's and be better off is, I think, incorrect" That's a pretty broad statement. First of all, there are at least a couple questions here: 1. Do we want 3VL? 2. If so, what is a good 3VL system to use? 3. Is there any 3VL system in existence (and clearly documented) with some kind of careful analysis behind it, that can stand up to criticism? 4. If 3VL is so important, why do most languages not provide it? 5. Are we talking about 3VL at all, or are we talking about something else (like "not applicable" or "absent value")? I don't like the vague notion that we should have something like NULL, regardless of the costs -- often with no real analysis about what those costs may be. R seems to be much more sane, but it's hardly without costs. And there are really no guidelines anywhere about how to manage those costs in day-to-day programming. In other words, which functions/methods should pass along an unknown value, which should raise an exception, and which should produce a value? 3VL permeates every aspect of application and database design, and I think that it's unwise to approach it as though any idea is a good solution. "[with NULLs] I can determine with one statement whether a result set is empty AND what the sum of the non-null data is." Actually, a NULL return from SUM() is still ambiguous. You're just shifting the ambiguity somewhere else. There are 4 cases: (1) You are passed no values (2) You are passed only non-NULL values (3) You are passed only NULLs (4) You are passed a mix of NULLs and non-NULL values In SQL, when SUM() returns NULL, that may mean either case #1 or case #3. When SUM() returns a non-NULL value that may mean either case #2 or case #4. In R, with NAs, if sum() returns a value, that may mean case #1 or case #2, and if it returns NA, that may mean case #3 or case #4. You can't eliminate ambiguity through crazy NULL semantics. I strongly prefer R's semantics; I believe that it's much more important to distinguish between #1 and #3 than #1 and #2. "There's no need to go back to the server," Why go back to the server? SELECT COUNT(*), SUM(foo) ... “I disagree with the statements that the results will take the form of the correct answer even if wrong”

Take a look at this example:
> 1 %in% c(3,2,NA)
[1] FALSE

That takes the form of a correct answer, but it looks wrong to me.

However, I’ll partially retract my claim that R passes the values along, because I did not realize that R throws an exception when testing NA (e.g. if(NA)). This puts R closer to the first category.

“In SQL the inconsistent way in which NULLS are treated can yield incorrect answers that look correct.”

I agree with that statement.

“then mean( heights ) in R yields NA – a correct answer”

Of course, if you start by using NULL to represent the third logical value, and then translate that to the third logical value in R — NA — you may get a correct result.

But that’s not always the case. Say you have a query to find the total value of cars by dealership, something like:

SELECT dealership_name, sum(car_value) FROM car GROUP BY dealership_name;

If some dealership has no cars, you get a NULL. That NULL should really be a 0. But because it’s a NULL, you end up with an NA in R, even though *there are no NULLs in your base data*. Now, there are all kinds of answers that R can’t give you, because it thinks that you’re missing information, even though your information is complete.

“NA’s propagate through the standard operators in a natural way.”

It’s certainly better than in SQL, I agree with that.

But how natural is this?:
> (x > 5 || x < 10)
[1] NA
> (x || !x)
[1] NA

Humans generally reason using basic tautologies, like “P OR NOT P” is always true. When you introduce 3VL, you lose useful tautologies such as that one. To the extent that “P OR NOT P” is intuitive, 3VL is unintuitive. So I don’t think it can be said without explanation that it is “natural”.

That being said, just because 3VL is unintuitive or unnatural doesn’t imply that it’s bad. What it does mean is that, if you base a system on 3VL, the users need to be educated in detail about your particular brand of 3VL semantics to overcome that intuition.

“the idea that we could eliminate NULL’s and be better off is, I think, incorrect”

That’s a pretty broad statement. First of all, there are at least a couple questions here:

1. Do we want 3VL?
2. If so, what is a good 3VL system to use?
3. Is there any 3VL system in existence (and clearly documented) with some kind of careful analysis behind it, that can stand up to criticism?
4. If 3VL is so important, why do most languages not provide it?
5. Are we talking about 3VL at all, or are we talking about something else (like “not applicable” or “absent value”)?

I don’t like the vague notion that we should have something like NULL, regardless of the costs — often with no real analysis about what those costs may be. R seems to be much more sane, but it’s hardly without costs. And there are really no guidelines anywhere about how to manage those costs in day-to-day programming. In other words, which functions/methods should pass along an unknown value, which should raise an exception, and which should produce a value? 3VL permeates every aspect of application and database design, and I think that it’s unwise to approach it as though any idea is a good solution.

“[with NULLs] I can determine with one statement whether a result set is empty AND what the sum of the non-null data is.”

Actually, a NULL return from SUM() is still ambiguous. You’re just shifting the ambiguity somewhere else.

There are 4 cases:
(1) You are passed no values
(2) You are passed only non-NULL values
(3) You are passed only NULLs
(4) You are passed a mix of NULLs and non-NULL values

In SQL, when SUM() returns NULL, that may mean either case #1 or case #3. When SUM() returns a non-NULL value that may mean either case #2 or case #4.

In R, with NAs, if sum() returns a value, that may mean case #1 or case #2, and if it returns NA, that may mean case #3 or case #4.

You can’t eliminate ambiguity through crazy NULL semantics. I strongly prefer R’s semantics; I believe that it’s much more important to distinguish between #1 and #3 than #1 and #2.

“There’s no need to go back to the server,”

Why go back to the server? SELECT COUNT(*), SUM(foo) …

]]>
By: Nathan Boley http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/comment-page-1/#comment-26 Nathan Boley Thu, 14 Aug 2008 06:47:37 +0000 http://davisjeff.wordpress.com/?p=14#comment-26 "Languages from the second category tend to pass the "undef" or "NA" along deeper into the application, which can cause unintuitive and difficult-to-trace problems. Perhaps worse, something will always happen, and usually the result will take the form of the correct answer even if it is wrong." At least for R, I disagree with the statements that the results will take the form of the correct answer even if wrong. In fact, I think that the exact opposite is true. In SQL the inconsistent way in which NULLS are treated can yield incorrect answers that look correct. One of the advantages of mapping NULL's to NA's is that the NA's propagate through the standard operators in a natural way. It is for that reason that I often prefer to do analysis in R rather than in the database. Consider if one were collecting anonymous data on the physical characteristics of a population. For heights we could store the data in the table below. CREATE TABLE heights { gender enum('male', 'female'); height_meters float; } with the data set female 1.48 female 1.54 female 1.57 female 1.41 male NULL male 1.72 male 1.79 male NULL If someone asks what is the population's expected average height, the typical answer is select average(height_meters) from heights; Which would return the wrong answer. For the above population, a better estimate for the average height would be select AVG(height_meters)/2 from heights where gender = 'male' + select AVG(height_meters)/2 from heights where gender = 'female' In R, under the NULL NA mapping, this becomes immediately apparent: select height_meters from heights then mean( heights ) in R yields NA - a correct answer. Given the above data set, the mean IS unknown. I would much prefer to receive NA as my answer than to receive an answer that looks correct but isn't. The real problem with NULL's isn't that they exist: it's how entrenched they are in SQL. It would be far better for the standard to raise errors than silently insert NULL's into result sets. This fits in line in with your assertion that "the best strategy is to try to interpret and remove NULLs as early as possible". Right now, NULL's are hard to intercept and deal with: particularly at the application level. A better error mechanism would help this. In addition, clearly "the inconsistency between 'not a value at all' and 'the third logical value'" is bad: the third logical type component should be handled at the individual type level ( i.e. int_w_null and int_wo_null ) and the 'not a value' component ( again ) should be handled by an error hierarchy. However, reading between the lines of this post and others, the idea that we could eliminate NULL's and be better off is, I think, incorrect. NULL's are a solution ( albeit a poor one ) to a complex, very general, and difficult problem. I think that sum( empty set ) -> NULL is stupid; however, if I know the standard, I can determine with one statement whether a result set is empty AND what the sum of the non-null data is. There's no extra database call to find the result set cardinality - like I do in R. There's no need to go back to the server, as I would in 99% of cases if, like R, the database returned NA for the sum over any NA's. And the missing case - the check for NULL existence - is fast and explicit. “Languages from the second category tend to pass the “undef” or “NA” along deeper into the application, which can cause unintuitive and difficult-to-trace problems. Perhaps worse, something will always happen, and usually the result will take the form of the correct answer even if it is wrong.”

At least for R, I disagree with the statements that the results will take the form of the correct answer even if wrong. In fact, I think that the exact opposite is true. In SQL the inconsistent way in which NULLS are treated can yield incorrect answers that look correct.

One of the advantages of mapping NULL’s to NA’s is that the NA’s propagate through the standard operators in a natural way. It is for that reason that I often prefer to do analysis in R rather than in the database. Consider if one were collecting anonymous data on the physical characteristics of a population. For heights we could store the data in the table below.

CREATE TABLE heights {
gender enum(‘male’, ‘female’);
height_meters float;
}

with the data set

female 1.48
female 1.54
female 1.57
female 1.41
male NULL
male 1.72
male 1.79
male NULL

If someone asks what is the population’s expected average height, the typical answer is

select average(height_meters) from heights;

Which would return the wrong answer. For the above population, a better estimate for the average height would be

select AVG(height_meters)/2 from heights where gender = ‘male’
+ select AVG(height_meters)/2 from heights where gender = ‘female’

In R, under the NULL NA mapping, this becomes immediately apparent:

select height_meters from heights

then mean( heights ) in R yields NA – a correct answer. Given the above data set, the mean IS unknown. I would much prefer to receive NA as my answer than to receive an answer that looks correct but isn’t.

The real problem with NULL’s isn’t that they exist: it’s how entrenched they are in SQL. It would be far better for the standard to raise errors than silently insert NULL’s into result sets. This fits in line in with your assertion that “the best strategy is to try to interpret and remove NULLs as early as possible”. Right now, NULL’s are hard to intercept and deal with: particularly at the application level. A better error mechanism would help this. In addition, clearly “the inconsistency between ‘not a value at all’ and ‘the third logical value’” is bad: the third logical type component should be handled at the individual type level ( i.e. int_w_null and int_wo_null ) and the ‘not a value’ component ( again ) should be handled by an error hierarchy.

However, reading between the lines of this post and others, the idea that we could eliminate NULL’s and be better off is, I think, incorrect. NULL’s are a solution ( albeit a poor one ) to a complex, very general, and difficult problem. I think that sum( empty set ) -> NULL is stupid; however, if I know the standard, I can determine with one statement whether a result set is empty AND what the sum of the non-null data is. There’s no extra database call to find the result set cardinality – like I do in R. There’s no need to go back to the server, as I would in 99% of cases if, like R, the database returned NA for the sum over any NA’s. And the missing case – the check for NULL existence – is fast and explicit.

]]>